During the last half-decade, a number of research efforts have centered around developing software for generating automatically tuned matrix multiplication kernels. These include the PHiPAC project and the ATLAS project. The software products of both projects employ brute force to search a parameter space for blockings that accommodate multiple levels of memory hierarchy. We take a different approach. Using a simple model of hierarchical memories we employ mathematics to determine a locally-optimal strategy for blocking matrices. The theoretical results show that, depending on the shape of the matrices involved, different strategies are locally-optimal. Rather than determining a blocking strategy at library generation time, the theoretical ...
Matrix-matrix multiplication is perhaps the most important operation used as a basic building block...
This paper discusses optimizing computational linear algebra algorithms on a ring cluster of IBM R...
Abstract. Traditional parallel programming methodologies for improv-ing performance assume cache-bas...
The optimal implementation of matrix multiplication on modern computer architectures is of great imp...
This report has been developed over the work done in the deliverable [Nava94] There it was shown tha...
Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been impleme...
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexityatthe expe...
Many fast algorithms in arithmetic complexity have hierarchical or recursive structures that make ef...
In the last decade floating-point matrix multiplication on FPGAs has been studied extensively and ef...
Abstract. This paper presents a study of performance optimization of dense matrix multiplication on ...
Abstract. This paper presents uniprocessor performance optimizations, automatic tuning techniques, a...
In this paper we demonstrate the practical portability of a simple version of matrix multiplication ...
Hierarchical matrix (H-matrix) techniques can be used to efficiently treat dense matrices. With an H...
This Master Thesis examines if a matrix multiplication program that combines the two efficiency stra...
Modern microprocessors can achieve high performance on linear algebra kernels but this currently req...
Matrix-matrix multiplication is perhaps the most important operation used as a basic building block...
This paper discusses optimizing computational linear algebra algorithms on a ring cluster of IBM R...
Abstract. Traditional parallel programming methodologies for improv-ing performance assume cache-bas...
The optimal implementation of matrix multiplication on modern computer architectures is of great imp...
This report has been developed over the work done in the deliverable [Nava94] There it was shown tha...
Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been impleme...
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexityatthe expe...
Many fast algorithms in arithmetic complexity have hierarchical or recursive structures that make ef...
In the last decade floating-point matrix multiplication on FPGAs has been studied extensively and ef...
Abstract. This paper presents a study of performance optimization of dense matrix multiplication on ...
Abstract. This paper presents uniprocessor performance optimizations, automatic tuning techniques, a...
In this paper we demonstrate the practical portability of a simple version of matrix multiplication ...
Hierarchical matrix (H-matrix) techniques can be used to efficiently treat dense matrices. With an H...
This Master Thesis examines if a matrix multiplication program that combines the two efficiency stra...
Modern microprocessors can achieve high performance on linear algebra kernels but this currently req...
Matrix-matrix multiplication is perhaps the most important operation used as a basic building block...
This paper discusses optimizing computational linear algebra algorithms on a ring cluster of IBM R...
Abstract. Traditional parallel programming methodologies for improv-ing performance assume cache-bas...