Modern microprocessors can achieve high performance on linear algebra kernels but this currently requires extensive machine-specific hand tuning. We have developed a methodology whereby near-peak performance on a wide range of systems can be achieved automatically for such routines. First, by analyzing current machines and C compilers, we’ve developed guidelines for writing Portable, High-Performance, ANSI C (PHiPAC, pronounced “fee-pack”). Second, rather than code by hand, we produce parameterized code generators. Third, we write search scripts that find the best parameters for a given system. We report on a BLAS GEMM compatible multi-level cache-blocked matrix multiply generator which produces code that achieves around 90 % of peak on the...
Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been impleme...
Abstract. This paper presents a study of performance optimization of dense matrix multiplication on ...
In the last decade floating-point matrix multiplication on FPGAs has been studied extensively and ef...
Modern microprocessors can achieve high performance on linear algebra kernels but this currently req...
During the last half-decade, a number of research efforts have centered around developing software f...
Achieving peak performance from the computational ker-nels that dominate application performance oft...
This thesis describes novel techniques and test implementations for optimizing numerically intensive...
This report has been developed over the work done in the deliverable [Nava94] There it was shown tha...
Abstract. Traditional parallel programming methodologies for improv-ing performance assume cache-bas...
Abstract—Scientific programmers often turn to vendor-tuned Basic Linear Algebra Subprograms (BLAS) t...
The performance of a parallel matrix-matrix-multiplication routine with the same functionality as DG...
Technology scaling trends have enabled the exponential growth of computing power. However, the perfo...
In this project I optimized the Dense Matrix-Matrix multiplication calculation by tiling the matrice...
In order to utilize the tremendous computing power of grpahics hardware and to automatically adapt t...
In this document, we describe two strategies of distribution of computations that can be used to imp...
Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been impleme...
Abstract. This paper presents a study of performance optimization of dense matrix multiplication on ...
In the last decade floating-point matrix multiplication on FPGAs has been studied extensively and ef...
Modern microprocessors can achieve high performance on linear algebra kernels but this currently req...
During the last half-decade, a number of research efforts have centered around developing software f...
Achieving peak performance from the computational ker-nels that dominate application performance oft...
This thesis describes novel techniques and test implementations for optimizing numerically intensive...
This report has been developed over the work done in the deliverable [Nava94] There it was shown tha...
Abstract. Traditional parallel programming methodologies for improv-ing performance assume cache-bas...
Abstract—Scientific programmers often turn to vendor-tuned Basic Linear Algebra Subprograms (BLAS) t...
The performance of a parallel matrix-matrix-multiplication routine with the same functionality as DG...
Technology scaling trends have enabled the exponential growth of computing power. However, the perfo...
In this project I optimized the Dense Matrix-Matrix multiplication calculation by tiling the matrice...
In order to utilize the tremendous computing power of grpahics hardware and to automatically adapt t...
In this document, we describe two strategies of distribution of computations that can be used to imp...
Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been impleme...
Abstract. This paper presents a study of performance optimization of dense matrix multiplication on ...
In the last decade floating-point matrix multiplication on FPGAs has been studied extensively and ef...