Modern data center applications have deep software stacks, with instruction footprints that are orders of magnitude larger than typical instruction cache (I-cache) sizes. To efficiently prefetch instructions into the I-cache despite large application footprints, modern server-class processors implement a decoupled frontend with Fetch Directed Instruction Prefetching (FDIP). In this work, we first characterize the limitations of a decoupled frontend processor with FDIP and find that FDIP suffers from significant Branch Target Buffer (BTB) misses. We also find that existing techniques (e.g., stream prefetchers and predecoders) are unable to mitigate these misses, as they rely on an incomplete understanding of a program’s branching behavior
L1 instruction-cache misses pose a critical performance bottleneck in commercial server workloads. C...
CPU speeds double approximately every eighteen months, while main memory speeds double only about ev...
High performance processors employ hardware data prefetching to reduce the negative performance impa...
In the pursuit of instruction-level parallelism, significant demands are placed on a processor's ins...
The large number of cache misses of current applications coupled with the increasing cache miss late...
The effort to reduce address translation overheads has typically targeted data accesses since they c...
Abstract. Given the increasing gap between processors and memory, prefetching data into cache become...
Abstract—Computer architecture is beset by two opposing trends. Technology scaling and deep pipelini...
Lookup operations for in-memory databases are heavily memory-bound because they often rely on pointe...
As the degree of instruction-level parallelism in superscalar architectures increases, the gap betwe...
Instruction cache miss latency is becoming an increasingly important performance bottleneck, especia...
Data-intensive applications often exhibit memory referencing patterns with little data reuse, result...
International audienceWhen designing a prefetcher, the computer architect has to define which event ...
Compiler-directed cache prefetching has the poten-tial to hide much of the high memory latency seen ...
Instruction cache misses can severely limit the performance of both superscalar processors and high ...
L1 instruction-cache misses pose a critical performance bottleneck in commercial server workloads. C...
CPU speeds double approximately every eighteen months, while main memory speeds double only about ev...
High performance processors employ hardware data prefetching to reduce the negative performance impa...
In the pursuit of instruction-level parallelism, significant demands are placed on a processor's ins...
The large number of cache misses of current applications coupled with the increasing cache miss late...
The effort to reduce address translation overheads has typically targeted data accesses since they c...
Abstract. Given the increasing gap between processors and memory, prefetching data into cache become...
Abstract—Computer architecture is beset by two opposing trends. Technology scaling and deep pipelini...
Lookup operations for in-memory databases are heavily memory-bound because they often rely on pointe...
As the degree of instruction-level parallelism in superscalar architectures increases, the gap betwe...
Instruction cache miss latency is becoming an increasingly important performance bottleneck, especia...
Data-intensive applications often exhibit memory referencing patterns with little data reuse, result...
International audienceWhen designing a prefetcher, the computer architect has to define which event ...
Compiler-directed cache prefetching has the poten-tial to hide much of the high memory latency seen ...
Instruction cache misses can severely limit the performance of both superscalar processors and high ...
L1 instruction-cache misses pose a critical performance bottleneck in commercial server workloads. C...
CPU speeds double approximately every eighteen months, while main memory speeds double only about ev...
High performance processors employ hardware data prefetching to reduce the negative performance impa...