In this paper we propose a set of optimizations for the BLAS-3 routines of LASs library (Linear Algebra routines on OmpSs) and perform a detailed analysis of the impact of the proposed changes in terms of performance and execution time. OmpSs allows to use regions in the dependences of the tasks. This helps not only in the programming of the algorithmic optimizations, but also in the reduction of the execution time achieved by such optimizations. Different strategies are implemented in order to reduce the amount of tasks created (when there is enough parallelism) during the execution of BLAS-3 operations in the original LASs. Also a better IPC is obtained thanks to a better memory hierarchy exploitation. More specifically, we increase the p...
The need for features for managing complex data accesses in modern programming models has increased ...
The final publication is available at Springer via http://dx.doi.org/10.1007/s10766-013-0249-6The in...
Abstract. Mr. Goto wrote a code to improve GEMM greatly as once the fastest program in the world. In...
© 2019 Elsevier. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://...
The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply ...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
Exascale performance will require massive parallelism and asynchronous execution (DARPA, DOE, EESI2)...
The functions library, called Basic Linear Algebra Subprograms (BLAS-1), is considered the programmi...
International audienceIn the last ten years, GPUs have dominated the market considering the computin...
This paper describes an implementation of Level 3 of the Basic Linear Algebra Subprogram (BLAS-3) li...
This paper describes an implementation of Level 3 of the Basic Linear Algebra Subprogram (BLAS-3) li...
One of the key areas for enabling users to efficiently use an HPC system is providing optimized BLAS...
In this paper, we investigate how to exploit task-parallelism during the execution of the Cholesky f...
Abstract The Basic Linear Algebra Subprograms, BLAS, are the basic computa-tional kernels in most ap...
We provide timing results for common linear algebra subroutines across BLAS (Basic Lin-ear Algebra S...
The need for features for managing complex data accesses in modern programming models has increased ...
The final publication is available at Springer via http://dx.doi.org/10.1007/s10766-013-0249-6The in...
Abstract. Mr. Goto wrote a code to improve GEMM greatly as once the fastest program in the world. In...
© 2019 Elsevier. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://...
The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply ...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
Exascale performance will require massive parallelism and asynchronous execution (DARPA, DOE, EESI2)...
The functions library, called Basic Linear Algebra Subprograms (BLAS-1), is considered the programmi...
International audienceIn the last ten years, GPUs have dominated the market considering the computin...
This paper describes an implementation of Level 3 of the Basic Linear Algebra Subprogram (BLAS-3) li...
This paper describes an implementation of Level 3 of the Basic Linear Algebra Subprogram (BLAS-3) li...
One of the key areas for enabling users to efficiently use an HPC system is providing optimized BLAS...
In this paper, we investigate how to exploit task-parallelism during the execution of the Cholesky f...
Abstract The Basic Linear Algebra Subprograms, BLAS, are the basic computa-tional kernels in most ap...
We provide timing results for common linear algebra subroutines across BLAS (Basic Lin-ear Algebra S...
The need for features for managing complex data accesses in modern programming models has increased ...
The final publication is available at Springer via http://dx.doi.org/10.1007/s10766-013-0249-6The in...
Abstract. Mr. Goto wrote a code to improve GEMM greatly as once the fastest program in the world. In...