Modern microprocessors can achieve high performance on linear algebra kernels but this currently requires extensive machine-specific hand tuning. We have developed a methodology whereby near-peak performance on a wide range of systems can be achieved automatically for such routines. First, by analyzing current machines and C compilers, we've developed guidelines for writing Portable, High-Performance, ANSI C (PHiPAC, pronounced "fee-pack"). Second, rather than code by hand, we produce parameterized code generators. Third, we write search scripts that find the best parameters for a given system. We report on a BLAS GEMM compatible multi-level cache-blocked matrix multiply generator which produces code that achieves around 90% ...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
Abstract. This paper presents a study of performance optimization of dense matrix multiplication on ...
Matrix computations lie at the heart of many scientific computational algorithms including signal pr...
Modern microprocessors can achieve high performance on linear algebra kernels but this currently req...
Modern microprocessors can achieve high performance on linear algebra kernels but this currently req...
This thesis describes novel techniques and test implementations for optimizing numerically intensive...
During the last half-decade, a number of research efforts have centered around developing software f...
The article is devoted to the vectorization of calculations for Intel Xeon Phi Knights Landing (KNL)...
As users and developers, we are witnessing the opening of a new computing scenario: the introduction...
Abstract—Scientific programmers often turn to vendor-tuned Basic Linear Algebra Subprograms (BLAS) t...
Abstract. Traditional parallel programming methodologies for improv-ing performance assume cache-bas...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
This report has been developed over the work done in the deliverable [Nava94] There it was shown tha...
This paper examines how to write code to gain high performance on modern computers as well as the im...
Abstract. Autotuning technology has emerged recently as a systematic process for evaluating alternat...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
Abstract. This paper presents a study of performance optimization of dense matrix multiplication on ...
Matrix computations lie at the heart of many scientific computational algorithms including signal pr...
Modern microprocessors can achieve high performance on linear algebra kernels but this currently req...
Modern microprocessors can achieve high performance on linear algebra kernels but this currently req...
This thesis describes novel techniques and test implementations for optimizing numerically intensive...
During the last half-decade, a number of research efforts have centered around developing software f...
The article is devoted to the vectorization of calculations for Intel Xeon Phi Knights Landing (KNL)...
As users and developers, we are witnessing the opening of a new computing scenario: the introduction...
Abstract—Scientific programmers often turn to vendor-tuned Basic Linear Algebra Subprograms (BLAS) t...
Abstract. Traditional parallel programming methodologies for improv-ing performance assume cache-bas...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
This report has been developed over the work done in the deliverable [Nava94] There it was shown tha...
This paper examines how to write code to gain high performance on modern computers as well as the im...
Abstract. Autotuning technology has emerged recently as a systematic process for evaluating alternat...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
Abstract. This paper presents a study of performance optimization of dense matrix multiplication on ...
Matrix computations lie at the heart of many scientific computational algorithms including signal pr...