AbstractWe describe the implementation and performance of dense matrix multiplication and LU decomposition on the GRAPE-DR SIMD accelerator board. A GRAPE-DR card, with 4 GRAPE-DR chips, has the theoretical peak DP performance of 819 Gflops. Each GRAPE-DR chip has 512 processing elements and operates with 400MHz clock cycle. each PE can perform one addition and one multiplication in every two clock cycles. The measured performance of matrix multiplication is 730 Gflops for the multiplication of matrices with size 51200 by 2048 and 2048 by 51200. The performance of LU decomposition is 480 Gflops for the problem size of 51200
Proceedings of: Third International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2016...
AbstractThis paper presents results of our study on double-precision general matrix-matrix multiplic...
AbstractWe describe the design and performance of the GRAPE-MP board, an SIMD accelerator board for ...
AbstractWe describe the implementation and performance of dense matrix multiplication and LU decompo...
Abstract: Few realize that, for large matrices, many dense matrix computations achieve nearly the sa...
Computations involving matrices form the kernel of a large spectrum of computationally demanding app...
General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized fo...
We describe an efficient implementation of a hierarchy of algorithms for multiplication of dense mat...
Abstract. Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major buildin...
Matrix multiplication is at the core of high-performance numerical computation. Software methods of ...
We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. O...
This repository contains the code and scripts for verifying the claims in the paper "Design Principl...
Sparse matrix–vector multiplication (SpMV) is of singular importance in sparse linear algebra, which...
Abstract. This paper presents a study of performance optimization of dense matrix multiplication on ...
We provide efficient single- and double-precision GPU (Graphics Processing Unit) implementa-tions of...
Proceedings of: Third International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2016...
AbstractThis paper presents results of our study on double-precision general matrix-matrix multiplic...
AbstractWe describe the design and performance of the GRAPE-MP board, an SIMD accelerator board for ...
AbstractWe describe the implementation and performance of dense matrix multiplication and LU decompo...
Abstract: Few realize that, for large matrices, many dense matrix computations achieve nearly the sa...
Computations involving matrices form the kernel of a large spectrum of computationally demanding app...
General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized fo...
We describe an efficient implementation of a hierarchy of algorithms for multiplication of dense mat...
Abstract. Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major buildin...
Matrix multiplication is at the core of high-performance numerical computation. Software methods of ...
We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. O...
This repository contains the code and scripts for verifying the claims in the paper "Design Principl...
Sparse matrix–vector multiplication (SpMV) is of singular importance in sparse linear algebra, which...
Abstract. This paper presents a study of performance optimization of dense matrix multiplication on ...
We provide efficient single- and double-precision GPU (Graphics Processing Unit) implementa-tions of...
Proceedings of: Third International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2016...
AbstractThis paper presents results of our study on double-precision general matrix-matrix multiplic...
AbstractWe describe the design and performance of the GRAPE-MP board, an SIMD accelerator board for ...