We present accurate piece-wise models for the time and energy costs of high performance implementations of both the matrix multiplication (gemm) and the triangular system solve with multiple right-hand sides (trsm) on x86 architectures. Our methodology decouples the costs due to the floating-point arithmetic/data movement occurring in the higher levels of the cache hierarchy from those of packing/data transfers between the main memory and the L2/L3 cache. A careful analytical study of the data transfers, in combination with an architecture-specific calibration of the costs per operation, render then the components to assemble piece-wise models for the accurate estimation of gemm and trsm’s performance on x86 processors. Our experimental ...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
In this paper we conduct a detailed analysis of the sources of power dissipation and energy consumpt...
In this thesis, the performance and energy efficiency of four different implementations of matrix mu...
This is the author’s version of a work that was accepted for publication in Simulation Modelling Pra...
We present accurate time and energy piece-wise models of high-performance multi-threaded implementat...
The power wall asks for a holistic effort from the high performance and scientific communities to de...
The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply ...
The overarching goal of this thesis is to provide an algorithm-centric approach to analyzing the rel...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
Commodity clusters augmented with application accelerators are evolving as competitive high performa...
In this paper, we propose a model for the energy consumption of the concurrent execution of three ke...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
Full-system simulation frameworks such as gem5 are used extensively to evaluate research ideas and f...
The performance of a parallel matrix-matrix-multiplication routine with the same functionality as DG...
Floating-point matrix multiplication is a basic kernel in scientific computing. It has been shown th...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
In this paper we conduct a detailed analysis of the sources of power dissipation and energy consumpt...
In this thesis, the performance and energy efficiency of four different implementations of matrix mu...
This is the author’s version of a work that was accepted for publication in Simulation Modelling Pra...
We present accurate time and energy piece-wise models of high-performance multi-threaded implementat...
The power wall asks for a holistic effort from the high performance and scientific communities to de...
The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply ...
The overarching goal of this thesis is to provide an algorithm-centric approach to analyzing the rel...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
Commodity clusters augmented with application accelerators are evolving as competitive high performa...
In this paper, we propose a model for the energy consumption of the concurrent execution of three ke...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
Full-system simulation frameworks such as gem5 are used extensively to evaluate research ideas and f...
The performance of a parallel matrix-matrix-multiplication routine with the same functionality as DG...
Floating-point matrix multiplication is a basic kernel in scientific computing. It has been shown th...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
In this paper we conduct a detailed analysis of the sources of power dissipation and energy consumpt...
In this thesis, the performance and energy efficiency of four different implementations of matrix mu...