Matrix factorization (or often called decomposition) is a frequently used kernel in a large number of applications ranging from linear solvers to data clustering and machine learning. The central contribution of this paper is a thorough performance study of four popular matrix factorization techniques, namely, LU, Cholesky, QR and SVD on the STI Cell broadband engine. The paper explores algorithmic as well as implementation challenges related to the Cell chip-multiprocessor and explains how we achieve near-linear speedup on most of the factorization techniques for a range of matrix sizes. For each of the factorization routines, we identify the bottleneck kernels and explain how we have attempted to resolve the bottleneck and to what extent ...
This paper represents the first attempt towards a decomposition-independent implementation of parall...
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distribu...
Abstract. The Chip Multiprocessor (CMP) will be the basic build-ing block for computer systems rangi...
The QR factorization is one of the most important operations in dense linear algebra, offering a num...
Abstract. The objective of this paper is to extend, in the context of multicore architectures, the c...
In this work, we examine the potential of using the recently-released STI Cell processor as a buildi...
The objective of this paper is to extend, in the context of multicore architectures, the concepts of...
The bottleneck of most data analyzing systems, signal processing systems, and intensive computing sy...
The slowing pace of commodity microprocessor performance improvements combined with ever-increasing ...
International audienceThe Sony/Toshiba/IBM (STI) CELL processor introduces pioneering solutions in p...
This article discusses the core factorization routines included in the ScaLAPACK library. These rout...
The slowing pace of commodity microprocessor performance improvements combined with ever-increasing ...
This paper describes the design and implementation of three core factorization routines--LU, QR and ...
We investigate performance characteristics for the LU factorization of large matrices with various s...
We pursue the scalable parallel implementation of the factor- ization of band matrices with medium ...
This paper represents the first attempt towards a decomposition-independent implementation of parall...
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distribu...
Abstract. The Chip Multiprocessor (CMP) will be the basic build-ing block for computer systems rangi...
The QR factorization is one of the most important operations in dense linear algebra, offering a num...
Abstract. The objective of this paper is to extend, in the context of multicore architectures, the c...
In this work, we examine the potential of using the recently-released STI Cell processor as a buildi...
The objective of this paper is to extend, in the context of multicore architectures, the concepts of...
The bottleneck of most data analyzing systems, signal processing systems, and intensive computing sy...
The slowing pace of commodity microprocessor performance improvements combined with ever-increasing ...
International audienceThe Sony/Toshiba/IBM (STI) CELL processor introduces pioneering solutions in p...
This article discusses the core factorization routines included in the ScaLAPACK library. These rout...
The slowing pace of commodity microprocessor performance improvements combined with ever-increasing ...
This paper describes the design and implementation of three core factorization routines--LU, QR and ...
We investigate performance characteristics for the LU factorization of large matrices with various s...
We pursue the scalable parallel implementation of the factor- ization of band matrices with medium ...
This paper represents the first attempt towards a decomposition-independent implementation of parall...
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distribu...
Abstract. The Chip Multiprocessor (CMP) will be the basic build-ing block for computer systems rangi...