AbstractWe describe the implementation and performance of dense matrix multiplication and LU decomposition on the GRAPE-DR SIMD accelerator board. A GRAPE-DR card, with 4 GRAPE-DR chips, has the theoretical peak DP performance of 819 Gflops. Each GRAPE-DR chip has 512 processing elements and operates with 400MHz clock cycle. each PE can perform one addition and one multiplication in every two clock cycles. The measured performance of matrix multiplication is 730 Gflops for the multiplication of matrices with size 51200 by 2048 and 2048 by 51200. The performance of LU decomposition is 480 Gflops for the problem size of 51200
We provide efficient single- and double-precision GPU (Graphics Processing Unit) implementa-tions of...
General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEM...
Sparse matrix–vector multiplication (SpMV) is of singular importance in sparse linear algebra, which...
AbstractWe describe the implementation and performance of dense matrix multiplication and LU decompo...
AbstractWe describe the design and performance of the GRAPE-MP board, an SIMD accelerator board for ...
Abstract: Few realize that, for large matrices, many dense matrix computations achieve nearly the sa...
In this project I optimized the Dense Matrix-Matrix multiplication calculation by tiling the matrice...
AbstractThis paper presents results of our study on double-precision general matrix-matrix multiplic...
As users and developers, we are witnessing the opening of a new computing scenario: the introduction...
Abstract. Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major buildin...
Proceedings of: Third International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2016...
We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. O...
This repository contains the code and scripts for verifying the claims in the paper "Design Principl...
General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized fo...
The performance of a parallel matrix-matrix-multiplication routine with the same functionality as DG...
We provide efficient single- and double-precision GPU (Graphics Processing Unit) implementa-tions of...
General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEM...
Sparse matrix–vector multiplication (SpMV) is of singular importance in sparse linear algebra, which...
AbstractWe describe the implementation and performance of dense matrix multiplication and LU decompo...
AbstractWe describe the design and performance of the GRAPE-MP board, an SIMD accelerator board for ...
Abstract: Few realize that, for large matrices, many dense matrix computations achieve nearly the sa...
In this project I optimized the Dense Matrix-Matrix multiplication calculation by tiling the matrice...
AbstractThis paper presents results of our study on double-precision general matrix-matrix multiplic...
As users and developers, we are witnessing the opening of a new computing scenario: the introduction...
Abstract. Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major buildin...
Proceedings of: Third International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2016...
We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. O...
This repository contains the code and scripts for verifying the claims in the paper "Design Principl...
General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized fo...
The performance of a parallel matrix-matrix-multiplication routine with the same functionality as DG...
We provide efficient single- and double-precision GPU (Graphics Processing Unit) implementa-tions of...
General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEM...
Sparse matrix–vector multiplication (SpMV) is of singular importance in sparse linear algebra, which...