We present Decoupled Vector Runahead (DVR), an in-core prefetching technique, executing separately to the main application thread, that exploits massive amounts of memory-level parallelism to improve the performance of applications featuring indirect memory accesses. DVR dynamically infers loop bounds at run-time, recognizing striding loads, and vectorizing subsequent instructions that are part of an indirect chain. It proactively issues memory accesses for the resulting loads far into the future, even when the out-of-order core has not yet stalled, bringing their data into the L1 cache, and thus providing timely prefetches for the main thread. DVR can adjust the degree of vectorization at run-time, vectorize the same chain of indirect memo...
The end of Dennard scaling is expected to shrink the range of DVFS in future nodes, limiting the ene...
Memory-intensive threads can hoard shared re- sources without making progress on a multithreading p...
This paper presents an experimental study on cache memory designs for vector computers. We use an ex...
We present Decoupled Vector Runahead (DVR), an in-core prefetching technique, executing separately t...
The memory wall places a significant limit on performance for many modern workloads. These applicati...
The purpose of this paper is to show that using decoupling techniques in a vector processor, the per...
Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands...
Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands...
As we approach the end of conventional technology scaling, computer architects are forced to incorpo...
We present Outrider, an architecture for throughput-oriented processors that exploits intra-thread m...
Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting propositi...
Journal ArticleThe speed gap between processors and memory system is becoming the performance bottle...
The purpose of this paper is to show that multi-threading techniques can be applied to a vector proc...
This paper describes future execution (FE), a simple hardware-only technique to accelerate indi-vidu...
Despite rapid increases in CPU performance, the primary obstacles to achieving higher performance in...
The end of Dennard scaling is expected to shrink the range of DVFS in future nodes, limiting the ene...
Memory-intensive threads can hoard shared re- sources without making progress on a multithreading p...
This paper presents an experimental study on cache memory designs for vector computers. We use an ex...
We present Decoupled Vector Runahead (DVR), an in-core prefetching technique, executing separately t...
The memory wall places a significant limit on performance for many modern workloads. These applicati...
The purpose of this paper is to show that using decoupling techniques in a vector processor, the per...
Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands...
Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands...
As we approach the end of conventional technology scaling, computer architects are forced to incorpo...
We present Outrider, an architecture for throughput-oriented processors that exploits intra-thread m...
Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting propositi...
Journal ArticleThe speed gap between processors and memory system is becoming the performance bottle...
The purpose of this paper is to show that multi-threading techniques can be applied to a vector proc...
This paper describes future execution (FE), a simple hardware-only technique to accelerate indi-vidu...
Despite rapid increases in CPU performance, the primary obstacles to achieving higher performance in...
The end of Dennard scaling is expected to shrink the range of DVFS in future nodes, limiting the ene...
Memory-intensive threads can hoard shared re- sources without making progress on a multithreading p...
This paper presents an experimental study on cache memory designs for vector computers. We use an ex...