General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA’s current CUBLAS implementation delivers only a fraction of the potential performance as indicated by the roofline model in this case. We describe the challenges and key characteristics of an implementation that can achieve close to optimal performance. We further evaluate different strategies of parallelization and thread distribution and devise a flexible, configurable mapping scheme. To ensure flexibility and allow for highly tailored implementations we use code gener...
We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for mult...
In this article, we discuss the performance modeling and optimization of Sparse Matrix-Vector Multip...
General purpose computing on graphics processing units (GPGPU) is fast becoming a common feature of ...
General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized fo...
In this paper we discuss about our experiences in improving the performance of two key algorithms: t...
AbstractThis paper presents results of our study on double-precision general matrix-matrix multiplic...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
Abstract. Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major buildin...
We provide efficient single- and double-precision GPU (Graphics Processing Unit) implementa-tions of...
We present several algorithms to compute the solution of a linear system of equations on a graphics ...
Modern graphics processing units (GPUs) have been at the leading edge of in-creasing chip-level para...
This paper presents initial experiments in implementing two notable matrix multiplication algorithms...
We present two designs (I and II) for IEEE 754 double precision floating point matrix multiplication...
Today’s hardware platforms have parallel processing capabilities and many parallel programming model...
We present two designs (I and II) for IEEE 754 double precision floating point matrix multiplication...
We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for mult...
In this article, we discuss the performance modeling and optimization of Sparse Matrix-Vector Multip...
General purpose computing on graphics processing units (GPGPU) is fast becoming a common feature of ...
General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized fo...
In this paper we discuss about our experiences in improving the performance of two key algorithms: t...
AbstractThis paper presents results of our study on double-precision general matrix-matrix multiplic...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
Abstract. Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major buildin...
We provide efficient single- and double-precision GPU (Graphics Processing Unit) implementa-tions of...
We present several algorithms to compute the solution of a linear system of equations on a graphics ...
Modern graphics processing units (GPUs) have been at the leading edge of in-creasing chip-level para...
This paper presents initial experiments in implementing two notable matrix multiplication algorithms...
We present two designs (I and II) for IEEE 754 double precision floating point matrix multiplication...
Today’s hardware platforms have parallel processing capabilities and many parallel programming model...
We present two designs (I and II) for IEEE 754 double precision floating point matrix multiplication...
We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for mult...
In this article, we discuss the performance modeling and optimization of Sparse Matrix-Vector Multip...
General purpose computing on graphics processing units (GPGPU) is fast becoming a common feature of ...