The promise of future many-core processors, with hundreds of threads running concurrently, has led the developers of linear algebra libraries to rethink their design in order to extract more parallelism, further exploit data locality, attain better load balance, and pay careful attention to the critical path of computation. In this paper we describe how existing serial libraries such as (C)LAPACK and FLAME can be easily parallelized using the SMPSs tools, consisting of a few OpenMP-like pragmas and a runtime system. In the LAPACK case, this usually requires the development of blocked algorithms for simple BLAS-level operations, which expose concurrency at a finer grain. For better performance, our experimental results indicate that column-m...
With the emergence of thread-level parallelism as the primary means for continued improvement of per...
Parallel accelerators are playing an increasingly important role in scientific computing. However, i...
With the emergence of thread-level parallelism as the primary means for continued improvement of per...
Software overheads can be a significant cause of performance degradation in parallel numerical libra...
Software overheads can be a significant cause of performance degradation in parallel numerical libra...
Dense linear algebra libraries need to cope efficiently with a range of input problem sizes and shap...
Exascale performance will require massive parallelism and asynchronous execution (DARPA, DOE, EESI2)...
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distribu...
Abstract In this document we present a new approach to developing sequential and parallel dense line...
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distribu...
This paper discusses the design of linear algebra libraries for high performance computers. Particul...
This paper presents an overview of the LAPACK library, a portable, public-domain library to solve th...
This dissertation details contributions made by the author to the field of computer science while wo...
In this paper, we investigate how to exploit task-parallelism during the execution of the Cholesky ...
We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using Ope...
With the emergence of thread-level parallelism as the primary means for continued improvement of per...
Parallel accelerators are playing an increasingly important role in scientific computing. However, i...
With the emergence of thread-level parallelism as the primary means for continued improvement of per...
Software overheads can be a significant cause of performance degradation in parallel numerical libra...
Software overheads can be a significant cause of performance degradation in parallel numerical libra...
Dense linear algebra libraries need to cope efficiently with a range of input problem sizes and shap...
Exascale performance will require massive parallelism and asynchronous execution (DARPA, DOE, EESI2)...
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distribu...
Abstract In this document we present a new approach to developing sequential and parallel dense line...
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distribu...
This paper discusses the design of linear algebra libraries for high performance computers. Particul...
This paper presents an overview of the LAPACK library, a portable, public-domain library to solve th...
This dissertation details contributions made by the author to the field of computer science while wo...
In this paper, we investigate how to exploit task-parallelism during the execution of the Cholesky ...
We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using Ope...
With the emergence of thread-level parallelism as the primary means for continued improvement of per...
Parallel accelerators are playing an increasingly important role in scientific computing. However, i...
With the emergence of thread-level parallelism as the primary means for continued improvement of per...