We present a new hardware-based data prefetching mechanism for enhancing instruction level parallelism and improving the performance of superscalar processors. The emphasis in our scheme is on the effective utilization of slack time and hardware resources not used for the main computation. The scheme suggests a new hardware construct, the Program Progress Graph (PPG), as a simple extension to the Branch Target Buffer (BTB). We use the PPG for implementing a fast pre-program counter, pre-PC, that travels only through memory reference instructions (rather than scanning all the instructions sequentially). In a single clock cycle the pre-PC extracts all the predicted memory references in some future block of instructions, to obtain early data p...
Despite rapid increases in CPU performance, the primary obstacles to achieving higher performance in...
With the continuing technological trend of ever cheaper and larger memory, most data sets in databas...
Despite large caches, main-memory access latencies still cause significant performance losses in man...
This paper describes a new hardware approach to data and instruction prefetching for superscalar pr...
The large latency of memory accesses in modern computer systems is a key obstacle to achieving high ...
Abstract. Given the increasing gap between processors and memory, prefetching data into cache become...
Recent technological advances are such that the gap between processor cycle times and memory cycle t...
The performance of superscalar processors is more sensitive to the memory system delay than their si...
This thesis considers two approaches to the design of high-performance computers. In a <I>single pro...
Instruction cache miss latency is becoming an increasingly important performance bottleneck, especia...
Scaling the performance of applications with little thread-level parallelism is one of the most seri...
A well known performance bottleneck in computer architecture is the so-called memory wall. This term...
Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting propositi...
A major performance limiter in modern processors is the long latencies caused by data cache misses. ...
Conventional cache prefetching approaches can be either hardware-based, generally by using a one-blo...
Despite rapid increases in CPU performance, the primary obstacles to achieving higher performance in...
With the continuing technological trend of ever cheaper and larger memory, most data sets in databas...
Despite large caches, main-memory access latencies still cause significant performance losses in man...
This paper describes a new hardware approach to data and instruction prefetching for superscalar pr...
The large latency of memory accesses in modern computer systems is a key obstacle to achieving high ...
Abstract. Given the increasing gap between processors and memory, prefetching data into cache become...
Recent technological advances are such that the gap between processor cycle times and memory cycle t...
The performance of superscalar processors is more sensitive to the memory system delay than their si...
This thesis considers two approaches to the design of high-performance computers. In a <I>single pro...
Instruction cache miss latency is becoming an increasingly important performance bottleneck, especia...
Scaling the performance of applications with little thread-level parallelism is one of the most seri...
A well known performance bottleneck in computer architecture is the so-called memory wall. This term...
Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting propositi...
A major performance limiter in modern processors is the long latencies caused by data cache misses. ...
Conventional cache prefetching approaches can be either hardware-based, generally by using a one-blo...
Despite rapid increases in CPU performance, the primary obstacles to achieving higher performance in...
With the continuing technological trend of ever cheaper and larger memory, most data sets in databas...
Despite large caches, main-memory access latencies still cause significant performance losses in man...