An increasing cache latency in future processors incurs profound performance impacts in spite of advanced out-of-order execution techniques. In this paper, we describe an early address resolution mechanism that accurately resolves both regular and irregular load addresses. The basic idea is to build dynamic dependence links from the instruction that updates the base register to the consumer load instructions. Once a new base address is available, it triggers calculations of the new load addresses for dependent loads. Furthermore, the exact cache location of the requested data is predicted based on the newly resolved load address. As a result, this direct load can access the data cache directly to achieve a zerocycle load latency. Performanc...
As CPU data requests to the level-one (L1) data cache (DC) can represent as much as 25% of an embedd...
As CPU data requests to the level-one (L1) data cache (DC) can represent as much as 25 % of an embed...
Future multi-core and many-core processors are likely to contain one or more high performance out-of...
For many programs, especially integer codes, untolerated load instruction latencies account for a si...
Abstract—An increasing cache latency in next-generation pro-cessors incurs profound performance impa...
Two orthogonal hardware techniques, table-based address prediction and early address calculation, fo...
Data cache misses reduce the performance of wide-issue processors by stalling the data supply to the...
By exploiting fine grain parallelism, superscalar processors can potentially increase the performanc...
Processor performance is directly impacted by the latency of the memory system. As processor core cy...
One major restriction to the performance of out-of-order superscalar processors is the latency of lo...
In this correspondence, we propose design techniques that may significantly simplify the cache acces...
While runahead execution is effective at parallelizing independent long-latency cache misses, it is ...
Untolerated load instruction latencies often have a significant impact on overall program performanc...
While runahead execution is effective at parallelizing independent long-latency cache misses, it is ...
By exploiting ne grain parallelism, superscalar processors can potentially increase the performance ...
As CPU data requests to the level-one (L1) data cache (DC) can represent as much as 25% of an embedd...
As CPU data requests to the level-one (L1) data cache (DC) can represent as much as 25 % of an embed...
Future multi-core and many-core processors are likely to contain one or more high performance out-of...
For many programs, especially integer codes, untolerated load instruction latencies account for a si...
Abstract—An increasing cache latency in next-generation pro-cessors incurs profound performance impa...
Two orthogonal hardware techniques, table-based address prediction and early address calculation, fo...
Data cache misses reduce the performance of wide-issue processors by stalling the data supply to the...
By exploiting fine grain parallelism, superscalar processors can potentially increase the performanc...
Processor performance is directly impacted by the latency of the memory system. As processor core cy...
One major restriction to the performance of out-of-order superscalar processors is the latency of lo...
In this correspondence, we propose design techniques that may significantly simplify the cache acces...
While runahead execution is effective at parallelizing independent long-latency cache misses, it is ...
Untolerated load instruction latencies often have a significant impact on overall program performanc...
While runahead execution is effective at parallelizing independent long-latency cache misses, it is ...
By exploiting ne grain parallelism, superscalar processors can potentially increase the performance ...
As CPU data requests to the level-one (L1) data cache (DC) can represent as much as 25% of an embedd...
As CPU data requests to the level-one (L1) data cache (DC) can represent as much as 25 % of an embed...
Future multi-core and many-core processors are likely to contain one or more high performance out-of...