This paper describes future execution (FE), a simple hardware-only technique to accelerate indi-vidual program threads running on multicore microprocessors. Our approach uses available idle cores to prefetch important data for the threads executing on the active cores. FE is based on the observation that many cache misses are caused by loads that execute repeatedly and whose address-generating program slices do not change (much) between consecutive executions. To exploit this property, FE dynamically creates a prefetching thread for each active core by simply sending a copy of all committed, register-writing instructions to an otherwise idle core. The key innovation is that on the way to the second core, a value predictor replaces each pred...
In-order microprocessors are increasingly adopted in a variety of multi-core chips due to their adva...
Instruction cache miss latency is becoming an increasingly important performance bottleneck, especia...
Simultaneous Multithreading (SMT) has been proposed for improving processor throughput by overlappin...
This paper proposes a new hardware technique for us-ing one core of a CMP to prefetch data for a thr...
Scaling the performance of applications with little thread-level parallelism is one of the most seri...
AbstractMemory access latency is a main bottleneck limiting further improvement of multi-core proces...
Hardly predictable data addresses in man), irregular applica-tions have rendered prefetching ineffec...
The era of multi-core processors has begun. These multi- core processors represent a significant shi...
Delinquent instructions are a small number of static instructions that cause most branch prediction ...
The exponentially increasing gap between processors and off-chip memory, as measured in processor cy...
Multicore processors have become ubiquitous in today's computing platforms, extending from smartphon...
Current integration trends embrace the prosperity of single-chip multi-core processors. Although mul...
Pre-execution uses helper threads running in spare hardware contexts to trigger cache misses in fron...
Pre-execution attacks cache misses for which conventional address-prediction driven prefetching is i...
The large latency of memory accesses in modern computer systems is a key obstacle to achieving high ...
In-order microprocessors are increasingly adopted in a variety of multi-core chips due to their adva...
Instruction cache miss latency is becoming an increasingly important performance bottleneck, especia...
Simultaneous Multithreading (SMT) has been proposed for improving processor throughput by overlappin...
This paper proposes a new hardware technique for us-ing one core of a CMP to prefetch data for a thr...
Scaling the performance of applications with little thread-level parallelism is one of the most seri...
AbstractMemory access latency is a main bottleneck limiting further improvement of multi-core proces...
Hardly predictable data addresses in man), irregular applica-tions have rendered prefetching ineffec...
The era of multi-core processors has begun. These multi- core processors represent a significant shi...
Delinquent instructions are a small number of static instructions that cause most branch prediction ...
The exponentially increasing gap between processors and off-chip memory, as measured in processor cy...
Multicore processors have become ubiquitous in today's computing platforms, extending from smartphon...
Current integration trends embrace the prosperity of single-chip multi-core processors. Although mul...
Pre-execution uses helper threads running in spare hardware contexts to trigger cache misses in fron...
Pre-execution attacks cache misses for which conventional address-prediction driven prefetching is i...
The large latency of memory accesses in modern computer systems is a key obstacle to achieving high ...
In-order microprocessors are increasingly adopted in a variety of multi-core chips due to their adva...
Instruction cache miss latency is becoming an increasingly important performance bottleneck, especia...
Simultaneous Multithreading (SMT) has been proposed for improving processor throughput by overlappin...