In heterogeneous systems that include CPUs and GPUs, the data transfers between these components play a crit-ical role in determining the performance of applications. Software pipelining is a common approach to mitigate the overheads of those transfers. In this paper we investigate advanced software-pipelining optimizations for the double-precision general matrix multiplication (DGEMM) algorith-m running on a heterogeneous system that includes ATI GPUs. Our approach decomposes the DGEMM workload to a finer detail and hides the latency of CPU-GPU da-ta transfers to a higher degree than previous approaches in literature. We implement our approach in a five-stage software pipelined DGEMM and analyze its performance on a platform including x86 ...
The Graphics Processing Unit (GPU) is present in almost every modern day personal computer. Despite...
General purpose computing on graphics processing units (GPGPU) is fast becoming a common feature of ...
Widespread heterogeneous parallelism is unavoidable given the emergence of General-Purpose computing...
AbstractThis paper presents results of our study on double-precision general matrix-matrix multiplic...
AbstractIn this paper we will present a detailed study on tuning double-precision matrix-matrix mult...
The use of auto-tuning techniques in a matrix multiplication routine for hybrid CPU+GPU platforms is...
As users and developers, we are witnessing the opening of a new computing scenario: the introduction...
The development of high performance dense linear algebra (DLA) critically depends on highly optimize...
Sparse general matrix multiplication (SpGEMM) is an important and expensive computation primitive in...
We present a new approach to utilizing all CPU cores and all GPUs on heterogeneous multicore and mul...
In this paper we focus on optimizing compute and memory-bandwidth-intensive GMM computations for low...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
his paper presents the design and implementation of a highly efficient Double-precision General Matr...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
Modern supercomputers rely on accelerators to speed up highly parallel workloads. Intricate programm...
The Graphics Processing Unit (GPU) is present in almost every modern day personal computer. Despite...
General purpose computing on graphics processing units (GPGPU) is fast becoming a common feature of ...
Widespread heterogeneous parallelism is unavoidable given the emergence of General-Purpose computing...
AbstractThis paper presents results of our study on double-precision general matrix-matrix multiplic...
AbstractIn this paper we will present a detailed study on tuning double-precision matrix-matrix mult...
The use of auto-tuning techniques in a matrix multiplication routine for hybrid CPU+GPU platforms is...
As users and developers, we are witnessing the opening of a new computing scenario: the introduction...
The development of high performance dense linear algebra (DLA) critically depends on highly optimize...
Sparse general matrix multiplication (SpGEMM) is an important and expensive computation primitive in...
We present a new approach to utilizing all CPU cores and all GPUs on heterogeneous multicore and mul...
In this paper we focus on optimizing compute and memory-bandwidth-intensive GMM computations for low...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
his paper presents the design and implementation of a highly efficient Double-precision General Matr...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
Modern supercomputers rely on accelerators to speed up highly parallel workloads. Intricate programm...
The Graphics Processing Unit (GPU) is present in almost every modern day personal computer. Despite...
General purpose computing on graphics processing units (GPGPU) is fast becoming a common feature of ...
Widespread heterogeneous parallelism is unavoidable given the emergence of General-Purpose computing...