Modern microprocessors can achieve high performance on linear algebra kernels but this currently requires extensive machine-specific hand tuning. We have developed a methodology whereby near-peak performance on a wide range of systems can be achieved automatically for such routines. First, by analyzing current machines and C compilers, we've developed guidelines for writing Portable, High-Performance, ANSI C (PHiPAC, pronounced "fee-pack"). Second, rather than code by hand, we produce parameterized code generators. Third, we write search scripts that nd the best parameters for a given system. We report on a BLAS GEMM compatible multi-level cache-blocked matrix multiply generator which produces code that achieves around 90 % o...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
The performance of a parallel matrix-matrix-multiplication routine with the same functionality as DG...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
Modern microprocessors can achieve high performance on linear algebra kernels but this currently req...
Modern microprocessors can achieve high performance on linear algebra kernels but this currently req...
This thesis describes novel techniques and test implementations for optimizing numerically intensive...
During the last half-decade, a number of research efforts have centered around developing software f...
The article is devoted to the vectorization of calculations for Intel Xeon Phi Knights Landing (KNL)...
Abstract—Scientific programmers often turn to vendor-tuned Basic Linear Algebra Subprograms (BLAS) t...
Abstract. Traditional parallel programming methodologies for improv-ing performance assume cache-bas...
As users and developers, we are witnessing the opening of a new computing scenario: the introduction...
This paper examines how to write code to gain high performance on modern computers as well as the im...
Abstract. This paper presents a study of performance optimization of dense matrix multiplication on ...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
This report has been developed over the work done in the deliverable [Nava94] There it was shown tha...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
The performance of a parallel matrix-matrix-multiplication routine with the same functionality as DG...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
Modern microprocessors can achieve high performance on linear algebra kernels but this currently req...
Modern microprocessors can achieve high performance on linear algebra kernels but this currently req...
This thesis describes novel techniques and test implementations for optimizing numerically intensive...
During the last half-decade, a number of research efforts have centered around developing software f...
The article is devoted to the vectorization of calculations for Intel Xeon Phi Knights Landing (KNL)...
Abstract—Scientific programmers often turn to vendor-tuned Basic Linear Algebra Subprograms (BLAS) t...
Abstract. Traditional parallel programming methodologies for improv-ing performance assume cache-bas...
As users and developers, we are witnessing the opening of a new computing scenario: the introduction...
This paper examines how to write code to gain high performance on modern computers as well as the im...
Abstract. This paper presents a study of performance optimization of dense matrix multiplication on ...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
This report has been developed over the work done in the deliverable [Nava94] There it was shown tha...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
The performance of a parallel matrix-matrix-multiplication routine with the same functionality as DG...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...