Row buffer locality is a consequence of programs' inherent spatial locality that the memory system can easily exploit for significant performance gains and power savings. However, as the number of cores on a chip increases, request streams become interleaved more frequently and row buffer locality is lost. Prefetching can help mitigate this effect, but more spatial locality remains to be recovered. In this thesis we propose Prefetch Bundling, a scheme which tags spatially correlated prefetches with information to allow the memory controller to prevent prefetches from becoming interleaved. We evaluate this scheme with a simple scheduling policy and show that it improves the row hit rate by 11%. Unfortunately, the simplicity of the sch...
Recent technological advances are such that the gap between processor cycle times and memory cycle t...
Memory access latency is the primary performance bottle-neck in modern computer systems. Prefetching...
textModern computer systems spend a substantial fraction of their running time waiting for data from...
While many parallel applications exhibit good spatial locality, other important codes in areas like ...
A well known performance bottleneck in computer architecture is the so-called memory wall. This term...
Abstract—Both on-chip resource contention and off-chip la-tencies have a significant impact on memor...
Software prefetching and locality optimizations are two techniques for overcoming the speed gap betw...
Abstract—Both on-chip resource contention and off-chip la-tencies have a significant impact on memor...
Software prefetching and locality optimizations are techniques for overcoming the speed gap between ...
Due to shared cache contentions and interconnect delays, data prefetching is more critical in allevi...
The gap between processor and memory speed appears as a serious bottleneck in improving the performa...
Memory access latency is the primary performance bottle-neck in modern computer systems. Prefetching...
Despite large caches, main-memory access latencies still cause significant performance losses in man...
Abstract. Given the increasing gap between processors and memory, prefetching data into cache become...
Software prefetching and locality optimizations are techniques for overcoming the gap between proces...
Recent technological advances are such that the gap between processor cycle times and memory cycle t...
Memory access latency is the primary performance bottle-neck in modern computer systems. Prefetching...
textModern computer systems spend a substantial fraction of their running time waiting for data from...
While many parallel applications exhibit good spatial locality, other important codes in areas like ...
A well known performance bottleneck in computer architecture is the so-called memory wall. This term...
Abstract—Both on-chip resource contention and off-chip la-tencies have a significant impact on memor...
Software prefetching and locality optimizations are two techniques for overcoming the speed gap betw...
Abstract—Both on-chip resource contention and off-chip la-tencies have a significant impact on memor...
Software prefetching and locality optimizations are techniques for overcoming the speed gap between ...
Due to shared cache contentions and interconnect delays, data prefetching is more critical in allevi...
The gap between processor and memory speed appears as a serious bottleneck in improving the performa...
Memory access latency is the primary performance bottle-neck in modern computer systems. Prefetching...
Despite large caches, main-memory access latencies still cause significant performance losses in man...
Abstract. Given the increasing gap between processors and memory, prefetching data into cache become...
Software prefetching and locality optimizations are techniques for overcoming the gap between proces...
Recent technological advances are such that the gap between processor cycle times and memory cycle t...
Memory access latency is the primary performance bottle-neck in modern computer systems. Prefetching...
textModern computer systems spend a substantial fraction of their running time waiting for data from...