Many optimizations (of programs with loops) used in parallelizing compilers and systolic array design are based on linear transformations of loop iteration spaces. Additional important optimizations and designs are possible by using modular mappings, which are described by linear transformations modulo a constant vector. In this thesis, necessary and sufficient conditions for modular mappings to be one-to-one are investigated for rectangular domains of arbitrary dimensions. This thesis also identifies and characterizes a class of (BLAS-like) algorithms that can be optimized for parallel execution by modular mappings. To reduce communication overheads, this thesis provides conditions of data alignments and partitioning that allow perfect ali...
Implementing linear algebra kernels on distributed memory parallel computers raises the problem of d...
Our experimental results showed that block based algorithms for numerically intensive applications a...
We describe a subset of the level-1, level-2, and level-3 BLAS implemented for each node of the Conn...
This report considers the problem of writing data distribution independent (DDI) programs in order t...
Two issues in linear algebra algorithms for multicomputers are addressed. First, how tounify paralle...
This paper discusses the design of linear algebra libraries for high performance computers. Particul...
Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been impleme...
This paper presents a technique for finding good distributions of arrays and suitable loop restructu...
We present a new fast and scalable matrix multiplication algorithm, called DIMMA (Distribution-Indep...
With the emergence of thread-level parallelism as the primary means for continued improvement of per...
Software overheads can be a significant cause of performance degradation in parallel numerical libra...
This work is a small step on the direction of code portability over parallel and vector machines. Th...
This paper describes parallel matrix transpose algorithms on distributed memory concurrent processor...
Abstract—This paper presents a data layout optimization technique for sequential and parallel progra...
Software overheads can be a significant cause of performance degradation in parallel numerical libra...
Implementing linear algebra kernels on distributed memory parallel computers raises the problem of d...
Our experimental results showed that block based algorithms for numerically intensive applications a...
We describe a subset of the level-1, level-2, and level-3 BLAS implemented for each node of the Conn...
This report considers the problem of writing data distribution independent (DDI) programs in order t...
Two issues in linear algebra algorithms for multicomputers are addressed. First, how tounify paralle...
This paper discusses the design of linear algebra libraries for high performance computers. Particul...
Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been impleme...
This paper presents a technique for finding good distributions of arrays and suitable loop restructu...
We present a new fast and scalable matrix multiplication algorithm, called DIMMA (Distribution-Indep...
With the emergence of thread-level parallelism as the primary means for continued improvement of per...
Software overheads can be a significant cause of performance degradation in parallel numerical libra...
This work is a small step on the direction of code portability over parallel and vector machines. Th...
This paper describes parallel matrix transpose algorithms on distributed memory concurrent processor...
Abstract—This paper presents a data layout optimization technique for sequential and parallel progra...
Software overheads can be a significant cause of performance degradation in parallel numerical libra...
Implementing linear algebra kernels on distributed memory parallel computers raises the problem of d...
Our experimental results showed that block based algorithms for numerically intensive applications a...
We describe a subset of the level-1, level-2, and level-3 BLAS implemented for each node of the Conn...