The performance of a parallel matrix-matrix-multiplication routine with the same functionality as DGEMM of BLAS3 was tested for different numbers of nodes on a 32-node iPSC/860. The routine was then tunned for maximum performance on this particular computer system. Small changes in the original code lead to substantially higher performance and in all tested configurations there is a critical matrix size n≈50·np, the number of processor, above which Intel's non-blocking isend is more efficient than the blocking csend. This shows that special tuning for a single machine pays off for large matrices
Abstract. This paper presents a study of performance optimization of dense matrix multiplication on ...
Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been impleme...
A parallel matrix multiplication algorithm is presented, and studies of its performance and estimati...
In this project I optimized the Dense Matrix-Matrix multiplication calculation by tiling the matrice...
AbstractIn this paper we will present a detailed study on tuning double-precision matrix-matrix mult...
This paper examines how to write code to gain high performance on modern computers as well as the im...
Parallel computing on networks of workstations are intensively used in some application areas such a...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
AbstractThis paper presents results of our study on double-precision general matrix-matrix multiplic...
Abstract. Traditional parallel programming methodologies for improv-ing performance assume cache-bas...
A number of parallel formulations of dense matrix multiplication algorithm have been developed. For ...
Abstract The Basic Linear Algebra Subprograms, BLAS, are the basic computa-tional kernels in most ap...
This paper describes a novel parallel algorithm that implements a dense matrix multiplication operat...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
Abstract. This paper presents a study of performance optimization of dense matrix multiplication on ...
Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been impleme...
A parallel matrix multiplication algorithm is presented, and studies of its performance and estimati...
In this project I optimized the Dense Matrix-Matrix multiplication calculation by tiling the matrice...
AbstractIn this paper we will present a detailed study on tuning double-precision matrix-matrix mult...
This paper examines how to write code to gain high performance on modern computers as well as the im...
Parallel computing on networks of workstations are intensively used in some application areas such a...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
For the past decade, power/energy consumption has become a limiting factor for large-scale and embed...
AbstractThis paper presents results of our study on double-precision general matrix-matrix multiplic...
Abstract. Traditional parallel programming methodologies for improv-ing performance assume cache-bas...
A number of parallel formulations of dense matrix multiplication algorithm have been developed. For ...
Abstract The Basic Linear Algebra Subprograms, BLAS, are the basic computa-tional kernels in most ap...
This paper describes a novel parallel algorithm that implements a dense matrix multiplication operat...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
Abstract. This paper presents a study of performance optimization of dense matrix multiplication on ...
Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been impleme...
A parallel matrix multiplication algorithm is presented, and studies of its performance and estimati...