Sparse matrix-vector (SpMV) multiplication is a widely used kernel in scientific applications. In these applications, the SpMV multiplication is usually deeply nested within multiple loops and thus executed a large number of times. We have observed that there can be significant performance variability, due to irregular memory access patterns. Static performance optimizations are difficult because the patterns may be known only at runtime. In this paper, we propose adaptive runtime tuning mechanisms to improve the parallel performance on distributed memory systems. Our adaptive iteration-to-process mapping mechanism balances computational load at runtime with negligible overhead (1% on average), and our runtime communication selection algori...
Sparse times dense matrix multiplication (SpMM) finds its applications in well-established fields su...
Parallelizing sparse irregular application on distributed memory systems poses serious scalability c...
Abstract—Using runtime information of load distributions and processor affinity, we propose an adapt...
Sparse matrix-vector (SpMV) multiplication is a widely used kernel in scientific applications. In th...
For a parallel Sparse Matrix Vector Multiply (SpMV) on a multiprocessor, rather simple and efficient...
Parallel sparse matrix-matrix multiplication algorithms (PSpGEMM) spend most of their running time o...
We consider optimizations that are required for efficient execution of code segments that consists o...
This paper describes a number of optimizations that can be used to support the efficient execution o...
AbstractSparse matrix vector multiplication (SpMV) is the dominant kernel in scientific simulations....
This whitepaper addresses applicability of the MapReduce paradigm for scientific computing by realiz...
Runtime specialization optimizes programs based on partial information available only at run time. I...
Abstract. This paper presents uniprocessor performance optimizations, automatic tuning techniques, a...
Sparse Matrix-vector Multiplication (SMvM) is a mathematical technique encountered in many programs ...
The tuning of parallel programs on large distributed-memory machines today is usually a costly, and ...
Abstract—In this paper we present two algorithms for perform-ing sparse matrix-dense vector multipli...
Sparse times dense matrix multiplication (SpMM) finds its applications in well-established fields su...
Parallelizing sparse irregular application on distributed memory systems poses serious scalability c...
Abstract—Using runtime information of load distributions and processor affinity, we propose an adapt...
Sparse matrix-vector (SpMV) multiplication is a widely used kernel in scientific applications. In th...
For a parallel Sparse Matrix Vector Multiply (SpMV) on a multiprocessor, rather simple and efficient...
Parallel sparse matrix-matrix multiplication algorithms (PSpGEMM) spend most of their running time o...
We consider optimizations that are required for efficient execution of code segments that consists o...
This paper describes a number of optimizations that can be used to support the efficient execution o...
AbstractSparse matrix vector multiplication (SpMV) is the dominant kernel in scientific simulations....
This whitepaper addresses applicability of the MapReduce paradigm for scientific computing by realiz...
Runtime specialization optimizes programs based on partial information available only at run time. I...
Abstract. This paper presents uniprocessor performance optimizations, automatic tuning techniques, a...
Sparse Matrix-vector Multiplication (SMvM) is a mathematical technique encountered in many programs ...
The tuning of parallel programs on large distributed-memory machines today is usually a costly, and ...
Abstract—In this paper we present two algorithms for perform-ing sparse matrix-dense vector multipli...
Sparse times dense matrix multiplication (SpMM) finds its applications in well-established fields su...
Parallelizing sparse irregular application on distributed memory systems poses serious scalability c...
Abstract—Using runtime information of load distributions and processor affinity, we propose an adapt...