Long-latency load requests continue to limit the performance of high-performance processors. To increase the latency tolerance of a processor, architects have primarily relied on two key techniques: sophisticated data prefetchers and large on-chip caches. In this work, we show that: 1) even a sophisticated state-of-the-art prefetcher can only predict half of the off-chip load requests on average across a wide range of workloads, and 2) due to the increasing size and complexity of on-chip caches, a large fraction of the latency of an off-chip load request is spent accessing the on-chip cache hierarchy. The goal of this work is to accelerate off-chip load requests by removing the on-chip cache access latency from their critical path. To this ...
Journal ArticleThe ever increasing sizes of on-chip caches and the growing domination of wire delay...
Many hardware optimizations rely on collecting information about program behavior at runtime. This i...
Approximate computing recognizes that many applications can tolerate inexactness. These applications...
International audienceLong-latency load requests continue to limit the performance of high-performan...
The “Memory Wall”, the vast gulf between processor execution speed and memory latency, has led to th...
A well known performance bottleneck in computer architecture is the so-called memory wall. This term...
With off-chip memory access taking 100's of processor cycles, getting data to the processor in a tim...
This work addresses the problem of the increasing performance disparity between the microprocessor a...
Recent technology advances enabled computerized services which have proliferated leading to a tremen...
As the degree of instruction-level parallelism in superscalar architectures increases, the gap betwe...
Despite a decade of research demonstrating its efficacy, address-correlated prefetching has never be...
Memory latency is a key bottleneck for many programs. Caching and prefetching are two popular hardwa...
Journal ArticleThe speed gap between processors and memory system is becoming the performance bottle...
Conventional cache prefetching approaches can be either hardware-based, generally by using a one-blo...
In this dissertation, we provide hardware solutions to increase the efficiency of the cache hierarch...
Journal ArticleThe ever increasing sizes of on-chip caches and the growing domination of wire delay...
Many hardware optimizations rely on collecting information about program behavior at runtime. This i...
Approximate computing recognizes that many applications can tolerate inexactness. These applications...
International audienceLong-latency load requests continue to limit the performance of high-performan...
The “Memory Wall”, the vast gulf between processor execution speed and memory latency, has led to th...
A well known performance bottleneck in computer architecture is the so-called memory wall. This term...
With off-chip memory access taking 100's of processor cycles, getting data to the processor in a tim...
This work addresses the problem of the increasing performance disparity between the microprocessor a...
Recent technology advances enabled computerized services which have proliferated leading to a tremen...
As the degree of instruction-level parallelism in superscalar architectures increases, the gap betwe...
Despite a decade of research demonstrating its efficacy, address-correlated prefetching has never be...
Memory latency is a key bottleneck for many programs. Caching and prefetching are two popular hardwa...
Journal ArticleThe speed gap between processors and memory system is becoming the performance bottle...
Conventional cache prefetching approaches can be either hardware-based, generally by using a one-blo...
In this dissertation, we provide hardware solutions to increase the efficiency of the cache hierarch...
Journal ArticleThe ever increasing sizes of on-chip caches and the growing domination of wire delay...
Many hardware optimizations rely on collecting information about program behavior at runtime. This i...
Approximate computing recognizes that many applications can tolerate inexactness. These applications...