We demonstrate that data reordering can substantially improve the performance of fine-grained irregular sharedmemory benchmarks, on both hardware and software shared-memory systems. In particular, we evaluate two distinct data reordering techniques that seek to co-locate in memory objects that are in close proximity in the physical system modeled by the computation. The effects of these techniques are increased spatial locality and reduced false sharing. We evaluate the effectiveness of the data reordering techniques on a set of five irregular applications from SPLASH-2 and Chaos. We implement both techniques in a small library, allowing us to enable them in an application by adding less than 10 lines of code. Our results on one hardware an...
Distributed Shared Memory (DSM) is becoming an accepted abstraction for programming distributed sy...
Irregular applications frequently exhibit poor performance on contemporary computer architectures, i...
The speed of processors increases much faster than the memory access time. This makes memory accesse...
The gap between CPU speed and memory speed in modern com-puter systems is widening as new generation...
Current system design trends continue to magnify the disparity between processor and memory perform...
Enhancing the match between software executions and hardware features is key to computing efficiency...
The gap between CPU speed and memory speed in modern computer systems is widening as new generations...
In this paper we explore the idea of customizing and reusing loop schedules to improve the scalabili...
An important class of scientific codes access memory in an irregular manner. Because irregular acce...
Data locality is a well-recognized requirement for the development of any parallel application, but ...
Software prefetching and locality optimizations are two techniques for overcoming the speed gap bet...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
We have developed compiler algorithms that analyze coarse-grained, explicitly parallel programs and ...
Software prefetching and locality optimizations are two techniques for overcoming the speed gap betw...
Distributed Shared Memory (DSM) is becoming an accepted abstraction for programming distributed sy...
Irregular applications frequently exhibit poor performance on contemporary computer architectures, i...
The speed of processors increases much faster than the memory access time. This makes memory accesse...
The gap between CPU speed and memory speed in modern com-puter systems is widening as new generation...
Current system design trends continue to magnify the disparity between processor and memory perform...
Enhancing the match between software executions and hardware features is key to computing efficiency...
The gap between CPU speed and memory speed in modern computer systems is widening as new generations...
In this paper we explore the idea of customizing and reusing loop schedules to improve the scalabili...
An important class of scientific codes access memory in an irregular manner. Because irregular acce...
Data locality is a well-recognized requirement for the development of any parallel application, but ...
Software prefetching and locality optimizations are two techniques for overcoming the speed gap bet...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
We have developed compiler algorithms that analyze coarse-grained, explicitly parallel programs and ...
Software prefetching and locality optimizations are two techniques for overcoming the speed gap betw...
Distributed Shared Memory (DSM) is becoming an accepted abstraction for programming distributed sy...
Irregular applications frequently exhibit poor performance on contemporary computer architectures, i...
The speed of processors increases much faster than the memory access time. This makes memory accesse...