Data prefetching is an effective technique to hide memory latency and thus bridge the increasing processor-memory performance gap. Our previous work presents guided region prefetching (GRP), a hardware/software cooperative prefetching technique which cost-effectively tolerates L2 latencies. The compiler hints improve L2 prefetching accuracy and reduce bus bandwidth consumption compared to hardware only prefetching. However, some useless prefetches remain to degrade memory performance. This paper first explores a more aggressive GRP prefetching scheme which pushes L2 prefetches into the L1, similar to the IBM Power 4 and 5 cache designs. This approach yields some additional performance improvements. This work then combines GRP with evict-me,...
A major performance limiter in modern processors is the long latencies caused by data cache misses. ...
As the trends of process scaling make memory system even more crucial bottleneck, the importance of ...
Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting propositi...
Despite large caches, main-memory access latencies still cause significant performance losses in man...
The memory system remains a major performance bottleneck in modern and future architectures. In this...
Ever increasing memory latencies and deeper pipelines push memory farther from the processor. Prefet...
Instruction cache miss latency is becoming an increasingly important performance bottleneck, especia...
In this dissertation, we provide hardware solutions to increase the efficiency of the cache hierarch...
The growing performance gap caused by high processor clock rates and slow DRAM accesses makes cache ...
Compiler-directed cache prefetching has the poten-tial to hide much of the high memory latency seen ...
Processor performance has increased far faster than memories have been able to keep up with, forcing...
The full text of this article is not available on SOAR. WSU users can access the article via IEEE Xp...
Prefetching, i.e., exploiting the overlap of processor com-putations with data accesses, is one of s...
This paper describes a new hardware approach to data and instruction prefetching for superscalar pr...
A well known performance bottleneck in computer architecture is the so-called memory wall. This term...
A major performance limiter in modern processors is the long latencies caused by data cache misses. ...
As the trends of process scaling make memory system even more crucial bottleneck, the importance of ...
Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting propositi...
Despite large caches, main-memory access latencies still cause significant performance losses in man...
The memory system remains a major performance bottleneck in modern and future architectures. In this...
Ever increasing memory latencies and deeper pipelines push memory farther from the processor. Prefet...
Instruction cache miss latency is becoming an increasingly important performance bottleneck, especia...
In this dissertation, we provide hardware solutions to increase the efficiency of the cache hierarch...
The growing performance gap caused by high processor clock rates and slow DRAM accesses makes cache ...
Compiler-directed cache prefetching has the poten-tial to hide much of the high memory latency seen ...
Processor performance has increased far faster than memories have been able to keep up with, forcing...
The full text of this article is not available on SOAR. WSU users can access the article via IEEE Xp...
Prefetching, i.e., exploiting the overlap of processor com-putations with data accesses, is one of s...
This paper describes a new hardware approach to data and instruction prefetching for superscalar pr...
A well known performance bottleneck in computer architecture is the so-called memory wall. This term...
A major performance limiter in modern processors is the long latencies caused by data cache misses. ...
As the trends of process scaling make memory system even more crucial bottleneck, the importance of ...
Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting propositi...