We discuss the high-performance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures, with an eye towards multi-core processors with many cores. We argue that traditional implementations, as those incorporated in LAPACK, cannot be easily modified to render high performance as well as scalability on these architectures. The solution we propose is to arrange the data structures and algorithms so that matrix blocks become the fundamental units of data, and operations on these blocks become the fundamental units of computation, resulting in algorithms-byblocks as opposed to the more traditional blocked algorithms. We show that this facilitates the adoption of techniques akin to dynamic scheduling ...
The emergence of new manycore architectures, such as the Intel Xeon Phi, poses new challenges in how...
An efficient data structure is presented which supports general unstructured sparse matrix-vector mu...
We pursue the scalable parallel implementation of the factor- ization of band matrices with medium ...
Out-of-core implementations of algorithms for dense matrix computations have traditionally focused o...
The multiplication of large spare matrices is a basic operation for many scientific and engineering ...
The arrival of multicore architectures has generated an interest in reformulating dense matrix compu...
With the emergence of thread-level parallelism as the primary means for continued improvement of per...
International audienceTask-based programming models have succeeded in gaining the interest of the hi...
Abstract. We consider the realization of matrix-matrix multiplication and propose a hierarchical alg...
In a previous PPoPP paper we showed how the FLAME method-ology, combined with the SuperMatrix runtim...
We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using Ope...
Abstract: Few realize that, for large matrices, many dense matrix computations achieve nearly the sa...
This paper presents a dynamic task scheduling approach to executing dense linear algebra algorithms ...
In this paper, we investigate how to exploit task-parallelism during the execution of the Cholesky ...
Abstract. We pursue the scalable parallel implementation of the factor-ization of band matrices with...
The emergence of new manycore architectures, such as the Intel Xeon Phi, poses new challenges in how...
An efficient data structure is presented which supports general unstructured sparse matrix-vector mu...
We pursue the scalable parallel implementation of the factor- ization of band matrices with medium ...
Out-of-core implementations of algorithms for dense matrix computations have traditionally focused o...
The multiplication of large spare matrices is a basic operation for many scientific and engineering ...
The arrival of multicore architectures has generated an interest in reformulating dense matrix compu...
With the emergence of thread-level parallelism as the primary means for continued improvement of per...
International audienceTask-based programming models have succeeded in gaining the interest of the hi...
Abstract. We consider the realization of matrix-matrix multiplication and propose a hierarchical alg...
In a previous PPoPP paper we showed how the FLAME method-ology, combined with the SuperMatrix runtim...
We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using Ope...
Abstract: Few realize that, for large matrices, many dense matrix computations achieve nearly the sa...
This paper presents a dynamic task scheduling approach to executing dense linear algebra algorithms ...
In this paper, we investigate how to exploit task-parallelism during the execution of the Cholesky ...
Abstract. We pursue the scalable parallel implementation of the factor-ization of band matrices with...
The emergence of new manycore architectures, such as the Intel Xeon Phi, poses new challenges in how...
An efficient data structure is presented which supports general unstructured sparse matrix-vector mu...
We pursue the scalable parallel implementation of the factor- ization of band matrices with medium ...