This paper discusses the importance of memory access optimizations which are shown to be highly effective on the MasPar architecture. The study is based on two MasPar machines, a 16K-processor MP-1 and a 4K-processor MP-2. A software pipelining technique overlaps memory accesses with computation and/or communication. Another optimization, called the register window technique reduces the number of loads in a loop. These techniques are evaluated using three parallel matrix multiplication algorithms on both the MasPar machines. The matrix multiplication study shows that for a highly computation intensive problem, reducing the interprocessor communication can become a secondary issue compared to memory access optimization. Also, it is shown t...
PhD ThesisCurrent microprocessors improve performance by exploiting instruction-level parallelism (I...
The emergence of a new, open, and free instruction set architecture, RISC-V, has heralded a new era ...
Achieving high application performance depends on the combination of memory footprint, instruction m...
This paper discusses the importance of memory access optimizations which are shown to be highly effe...
This work explores the tradeoffs of the memory system of a new massively parallel multiprocessor in ...
The power, frequency, and memory wall problems have caused a major shift in mainstream computing by ...
Thesis (Ph.D.), School of Electrical Engineering and Computer Science, Washington State UniversityPa...
The objective of high performance computing (HPC) is to ensure that the computational power of hardw...
One of the critical problems facing designers of high performance processors is the disparity betwee...
Memory bandwidth has become the performance bottleneck for memory intensive programs on modern proce...
The performance of memory-bound commercial applications such as databases is limited by increasing m...
Abstract. Traditional parallel programming methodologies for improv-ing performance assume cache-bas...
Accessing the memory efficiently to keep up with the data processing rate is a well known problem in...
In the last three years, GPUs are more and more being used for general purpose applications instead ...
Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been impleme...
PhD ThesisCurrent microprocessors improve performance by exploiting instruction-level parallelism (I...
The emergence of a new, open, and free instruction set architecture, RISC-V, has heralded a new era ...
Achieving high application performance depends on the combination of memory footprint, instruction m...
This paper discusses the importance of memory access optimizations which are shown to be highly effe...
This work explores the tradeoffs of the memory system of a new massively parallel multiprocessor in ...
The power, frequency, and memory wall problems have caused a major shift in mainstream computing by ...
Thesis (Ph.D.), School of Electrical Engineering and Computer Science, Washington State UniversityPa...
The objective of high performance computing (HPC) is to ensure that the computational power of hardw...
One of the critical problems facing designers of high performance processors is the disparity betwee...
Memory bandwidth has become the performance bottleneck for memory intensive programs on modern proce...
The performance of memory-bound commercial applications such as databases is limited by increasing m...
Abstract. Traditional parallel programming methodologies for improv-ing performance assume cache-bas...
Accessing the memory efficiently to keep up with the data processing rate is a well known problem in...
In the last three years, GPUs are more and more being used for general purpose applications instead ...
Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been impleme...
PhD ThesisCurrent microprocessors improve performance by exploiting instruction-level parallelism (I...
The emergence of a new, open, and free instruction set architecture, RISC-V, has heralded a new era ...
Achieving high application performance depends on the combination of memory footprint, instruction m...