We introduce a method for improving the cache performance of irregular computations in which data are referenced through run-time defined indirection arrays. Such computations often arise in scientific problems. The presented method, called Run-Time Reference Clustering (RTRC), is a run-time analog of a compile-time blocking used for dense matrix problems. RTRC uses the data partitioning and re-mapping techniques that are a part of distributed memory multi-processor codes designed to minimize interprocessor communication. Re-mapping each set of local data decreases cache-misses the same way remapping the global data decreases off-processor references. We demonstrate the applicability and performance of the RTRC technique on several prevalen...
© 1994 ACM. In the past decade, processor speed has become significantly faster than memory speed. S...
This paper describes a number of optimizations that can be used to support the efficient execution o...
Applications with regular patterns of memory access can experience high levels of cache conflict mis...
With the rapid improvement of processor speed, performance of the memory hierarchy has become the pr...
The most important processor performance bottleneck is the ever-increasing gap between the memory an...
Applications often under-utilize cache space and there are no software locality optimization techniq...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
Despite rapid increases in CPU performance, the primary obstacles to achieving higher performance in...
Abstract—Exploiting locality of reference is key to realizing high levels of performance on modern p...
High-performance scientific computing relies increasingly on high-level large-scale object-oriented ...
In modern clustering environments where the memory hierarchy has many layers (distributed memory, sh...
This paper describes a technique for improving the data ref-erence locality of parallel programs usi...
Performance tuning, as carried out by compiler designers and application programmers to close the pe...
Hardware trends have produced an increasing disparity between processor speeds and memory access tim...
Measurements of actual supercomputer cache performance has not been previously undertaken. PFC-Sim i...
© 1994 ACM. In the past decade, processor speed has become significantly faster than memory speed. S...
This paper describes a number of optimizations that can be used to support the efficient execution o...
Applications with regular patterns of memory access can experience high levels of cache conflict mis...
With the rapid improvement of processor speed, performance of the memory hierarchy has become the pr...
The most important processor performance bottleneck is the ever-increasing gap between the memory an...
Applications often under-utilize cache space and there are no software locality optimization techniq...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
Despite rapid increases in CPU performance, the primary obstacles to achieving higher performance in...
Abstract—Exploiting locality of reference is key to realizing high levels of performance on modern p...
High-performance scientific computing relies increasingly on high-level large-scale object-oriented ...
In modern clustering environments where the memory hierarchy has many layers (distributed memory, sh...
This paper describes a technique for improving the data ref-erence locality of parallel programs usi...
Performance tuning, as carried out by compiler designers and application programmers to close the pe...
Hardware trends have produced an increasing disparity between processor speeds and memory access tim...
Measurements of actual supercomputer cache performance has not been previously undertaken. PFC-Sim i...
© 1994 ACM. In the past decade, processor speed has become significantly faster than memory speed. S...
This paper describes a number of optimizations that can be used to support the efficient execution o...
Applications with regular patterns of memory access can experience high levels of cache conflict mis...