The access latency of branch predictors is a well known problem of fetch engine design. Prediction overriding techniques are commonly accepted to overcome this problem. However, prediction overriding requires a complex recovery mechanism to discard the wrong speculative work based on overridden predictions. In this paper, we show that stream and trace predictors, which use long basic prediction units, can tolerate access latency without needing overriding, thus reducing fetch engine complexity. We show that both the stream fetch engine and the trace cache architecture not using overriding outperform other efficient fetch engines, such as an EV8-like fetch architecture or the FTB fetch engine, even when they do use overriding.Peer ReviewedPo...
Journal ArticleThe speed gap between processors and memory system is becoming the performance bottle...
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright...
A basic rule in computer architecture is that a processor cannot execute an application faster than ...
The access latency of branch predictors is a well known problem of fetch engine design. Prediction o...
The stream fetch engine is a high-performance fetch architecture based on the concept of an instruct...
This work presents several techniques for enlarging instruction streams. We call stream to a sequenc...
Fetch engine performance is seriously limited by the branch prediction table access latency. This fa...
Fetch performance is a very important factor because it effectively limits the overall processor per...
The next stream predictor is an accurate branch predictor that provides stream level sequencing. Eve...
Modern microprocessors employ increasingly complicated branch predictors to achieve instruction fetc...
Abstract: Executing multiple threads has proved to be an effective solution to partially hide latenc...
Future processors combining out-of-order execution with aggressive speculation techniques will need ...
L1 instruction-cache misses pose a critical performance bottleneck in commercial server workloads. C...
A sequence of branch instructions in the dynamic instruction stream forms a branch sequence if at mo...
Hard-to-predict branches depending on long-latency cache-misses have been recognized as a major perf...
Journal ArticleThe speed gap between processors and memory system is becoming the performance bottle...
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright...
A basic rule in computer architecture is that a processor cannot execute an application faster than ...
The access latency of branch predictors is a well known problem of fetch engine design. Prediction o...
The stream fetch engine is a high-performance fetch architecture based on the concept of an instruct...
This work presents several techniques for enlarging instruction streams. We call stream to a sequenc...
Fetch engine performance is seriously limited by the branch prediction table access latency. This fa...
Fetch performance is a very important factor because it effectively limits the overall processor per...
The next stream predictor is an accurate branch predictor that provides stream level sequencing. Eve...
Modern microprocessors employ increasingly complicated branch predictors to achieve instruction fetc...
Abstract: Executing multiple threads has proved to be an effective solution to partially hide latenc...
Future processors combining out-of-order execution with aggressive speculation techniques will need ...
L1 instruction-cache misses pose a critical performance bottleneck in commercial server workloads. C...
A sequence of branch instructions in the dynamic instruction stream forms a branch sequence if at mo...
Hard-to-predict branches depending on long-latency cache-misses have been recognized as a major perf...
Journal ArticleThe speed gap between processors and memory system is becoming the performance bottle...
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright...
A basic rule in computer architecture is that a processor cannot execute an application faster than ...