his paper presents the design and implementation of a highly efficient Double-precision General Matrix Multiplication (DGEMM) based on Open BLAS for 64-bit ARMv8 eight-core processors. We adopt a theory-guided approach by first developing a performance model for this architecture and then using it to guide our exploration. The key enabler for a highly efficient DGEMM is a highly-optimized inner kernel GEBP developed in assembly language. We have obtained GEBP by (1) maximizing its compute-to-memory access ratios across all levels of the memory hierarchy in the ARMv8 architecture with its performance-critical block sizes being determined analytically, and (2) optimizing its computations through exploiting loop unrolling, instruction scheduli...
International audienceThis paper proposes a micro-kernel to efficiently compute 4x4 8-bit matrix mul...
As users and developers, we are witnessing the opening of a new computing scenario: the introduction...
In heterogeneous systems that include CPUs and GPUs, the data transfers between these components pla...
AbstractIn this paper we will present a detailed study on tuning double-precision matrix-matrix mult...
Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the pote...
AbstractThis paper presents results of our study on double-precision general matrix-matrix multiplic...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
BLIS is a new framework for rapid instantiation of the BLAS. We describe how BLIS extends the “GotoB...
One of the key areas for enabling users to efficiently use an HPC system is providing optimized BLAS...
This paper describes a novel parallel algorithm that implements a dense matrix multiplication operat...
Sparse general matrix multiplication (SpGEMM) is an important and expensive computation primitive in...
This paper examines how to write code to gain high performance on modern computers as well as the im...
The performance of a parallel matrix-matrix-multiplication routine with the same functionality as DG...
Sparse general matrix multiplication (SpGEMM) is a fundamental building block for many real-world ap...
Abstract. Traditional parallel programming methodologies for improv-ing performance assume cache-bas...
International audienceThis paper proposes a micro-kernel to efficiently compute 4x4 8-bit matrix mul...
As users and developers, we are witnessing the opening of a new computing scenario: the introduction...
In heterogeneous systems that include CPUs and GPUs, the data transfers between these components pla...
AbstractIn this paper we will present a detailed study on tuning double-precision matrix-matrix mult...
Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the pote...
AbstractThis paper presents results of our study on double-precision general matrix-matrix multiplic...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
BLIS is a new framework for rapid instantiation of the BLAS. We describe how BLIS extends the “GotoB...
One of the key areas for enabling users to efficiently use an HPC system is providing optimized BLAS...
This paper describes a novel parallel algorithm that implements a dense matrix multiplication operat...
Sparse general matrix multiplication (SpGEMM) is an important and expensive computation primitive in...
This paper examines how to write code to gain high performance on modern computers as well as the im...
The performance of a parallel matrix-matrix-multiplication routine with the same functionality as DG...
Sparse general matrix multiplication (SpGEMM) is a fundamental building block for many real-world ap...
Abstract. Traditional parallel programming methodologies for improv-ing performance assume cache-bas...
International audienceThis paper proposes a micro-kernel to efficiently compute 4x4 8-bit matrix mul...
As users and developers, we are witnessing the opening of a new computing scenario: the introduction...
In heterogeneous systems that include CPUs and GPUs, the data transfers between these components pla...