© 2019 Elsevier. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/In this work we have implemented a novel Linear Algebra Library on top of the task-based runtime OmpSs-2. We have used some of the most advanced OmpSs-2 features; weak dependencies and regions, together with the final clause for the implementation of auto-tunable code for the BLAS-3 trsm routine and the LAPACK routines npgetrf and npgesv. All these implementations are part of the first prototype of sLASs library, a novel library for auto-tunable codes for linear algebra operations based on LASs library. In all these cases, the use of the OmpSs-2 features presents an improvement in terms of execution t...
The high performance computing (HPC) community is obsessed over the general matrix-matrix multiply (...
Standard fine-tuning of large pre-trained language models (PLMs) for downstream tasks requires updat...
Processors with large numbers of cores are becoming commonplace. In order to take advantage of the a...
In this paper we propose a set of optimizations for the BLAS-3 routines of LASs library (Linear Alge...
The final publication is available at Springer via http://dx.doi.org/10.1007/s10766-013-0249-6The in...
Exascale performance will require massive parallelism and asynchronous execution (DARPA, DOE, EESI2)...
AbstractThe use of an OpenMP compiler optimized for the corresponding multicore system is a good opt...
It is rare for a programmer to solve a numerical problem with a single library call; most problems r...
The promise of future many-core processors, with hundreds of threads running concurrently, has led t...
AbstractThe introduction of auto-tuning techniques in linear algebra routines using hybrid combinati...
This paper describes an approach for the automatic generation and optimization of numerical softwar...
We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using Ope...
AbstractIn this work the behavior of the multithreaded implementation of some LAPACK routines on PLA...
This paper presents an overview of the LAPACK library, a portable, public-domain library to solve th...
Task-based programming is a high performance and productive model to express parallelism. Tasks enca...
The high performance computing (HPC) community is obsessed over the general matrix-matrix multiply (...
Standard fine-tuning of large pre-trained language models (PLMs) for downstream tasks requires updat...
Processors with large numbers of cores are becoming commonplace. In order to take advantage of the a...
In this paper we propose a set of optimizations for the BLAS-3 routines of LASs library (Linear Alge...
The final publication is available at Springer via http://dx.doi.org/10.1007/s10766-013-0249-6The in...
Exascale performance will require massive parallelism and asynchronous execution (DARPA, DOE, EESI2)...
AbstractThe use of an OpenMP compiler optimized for the corresponding multicore system is a good opt...
It is rare for a programmer to solve a numerical problem with a single library call; most problems r...
The promise of future many-core processors, with hundreds of threads running concurrently, has led t...
AbstractThe introduction of auto-tuning techniques in linear algebra routines using hybrid combinati...
This paper describes an approach for the automatic generation and optimization of numerical softwar...
We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using Ope...
AbstractIn this work the behavior of the multithreaded implementation of some LAPACK routines on PLA...
This paper presents an overview of the LAPACK library, a portable, public-domain library to solve th...
Task-based programming is a high performance and productive model to express parallelism. Tasks enca...
The high performance computing (HPC) community is obsessed over the general matrix-matrix multiply (...
Standard fine-tuning of large pre-trained language models (PLMs) for downstream tasks requires updat...
Processors with large numbers of cores are becoming commonplace. In order to take advantage of the a...