Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting proposition to solve this is software prefetching, where special non-blocking loads are used to bring data into the cache hierarchy just before being required. However, these are difficult to insert to effectively improve performance, and techniques for automatic insertion are currently limited. This paper develops a novel compiler pass to automatically generate software prefetches for indirect memory accesses, a special class of irregular memory accesses often seen in high-performance workloads. We evaluate this across a wide set of systems, all of which gain benefit from the technique. We then evaluate the extent to which good prefetch instruction...
As the gap between processor and memory speeds widens, program performance is increasingly dependent...
Compiler-directed cache prefetching has the poten-tial to hide much of the high memory latency seen ...
Modern processors and compilers hide long memory latencies through non-blocking loads or explicit so...
Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting propositi...
A major performance limiter in modern processors is the long latencies caused by data cache misses. ...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/19...
Ever increasing memory latencies and deeper pipelines push memory farther from the processor. Prefet...
Despite large caches, main-memory access latencies still cause significant performance losses in man...
Despite rapid increases in CPU performance, the primary obstacles to achieving higher performance in...
Instruction cache miss latency is becoming an increasingly important performance bottleneck, especia...
Prefetching, i.e., exploiting the overlap of processor com-putations with data accesses, is one of s...
Indirect memory accesses have irregular access patterns that limit the performance of conventional s...
Modern processors apply sophisticated techniques, such as deep cache hierarchies and hardware prefet...
Software prefetching and locality optimizations are techniques for overcoming the speed gap between ...
Modern processors and compilers hide long memory latencies through non-blocking loads or explicit so...
As the gap between processor and memory speeds widens, program performance is increasingly dependent...
Compiler-directed cache prefetching has the poten-tial to hide much of the high memory latency seen ...
Modern processors and compilers hide long memory latencies through non-blocking loads or explicit so...
Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting propositi...
A major performance limiter in modern processors is the long latencies caused by data cache misses. ...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/19...
Ever increasing memory latencies and deeper pipelines push memory farther from the processor. Prefet...
Despite large caches, main-memory access latencies still cause significant performance losses in man...
Despite rapid increases in CPU performance, the primary obstacles to achieving higher performance in...
Instruction cache miss latency is becoming an increasingly important performance bottleneck, especia...
Prefetching, i.e., exploiting the overlap of processor com-putations with data accesses, is one of s...
Indirect memory accesses have irregular access patterns that limit the performance of conventional s...
Modern processors apply sophisticated techniques, such as deep cache hierarchies and hardware prefet...
Software prefetching and locality optimizations are techniques for overcoming the speed gap between ...
Modern processors and compilers hide long memory latencies through non-blocking loads or explicit so...
As the gap between processor and memory speeds widens, program performance is increasingly dependent...
Compiler-directed cache prefetching has the poten-tial to hide much of the high memory latency seen ...
Modern processors and compilers hide long memory latencies through non-blocking loads or explicit so...