The access latency of branch predictors is a well known problem of fetch engine design. Prediction overriding techniques are commonly accepted to overcome this problem. However, prediction overriding requires a complex recovery mechanism to discard the wrong speculative work based on overridden predictions. In this paper, we show that stream and trace predictors, which use long basic prediction units, can tolerate access latency without needing overriding, thus reducing fetch engine complexity. We show that both the stream fetch engine and the trace cache architecture not using overriding outperform other efficient fetch engines, such as an EV8-like fetch architecture or the FTB fetch engine, even when they do use overriding.Peer Reviewe
Predictors were developed to meet the requirement for accurate branch prediction in high-performance...
During the 1990s Two-level Adaptive Branch Predictors were developed to meet the requirement for acc...
As modern microprocessors employ deeper pipelines and issue multiple instructions per cycle, they ar...
The access latency of branch predictors is a well known problem of fetch engine design. Prediction o...
Fetch engine performance is seriously limited by the branch prediction table access latency. This fa...
The stream fetch engine is a high-performance fetch architecture based on the concept of an instruct...
Abstract: Executing multiple threads has proved to be an effective solution to partially hide latenc...
Hard-to-predict branches depending on long-latency cache-misses have been recognized as a major perf...
This work presents several techniques for enlarging instruction streams. We call stream to a sequenc...
Modern microprocessors employ increasingly complicated branch predictors to achieve instruction fetc...
The next stream predictor is an accurate branch predictor that provides stream level sequencing. Eve...
Abstract — Executing multiple threads has proved to be an effective solution to partially hide laten...
A sequence of branch instructions in the dynamic instruction stream forms a branch sequence if at mo...
Modern superscalar processors rely on branch predictors to sustain a high instruction fetch throughp...
During the 1990s Two-level Adaptive Branch Predictors were developed to meet the requirement for acc...
Predictors were developed to meet the requirement for accurate branch prediction in high-performance...
During the 1990s Two-level Adaptive Branch Predictors were developed to meet the requirement for acc...
As modern microprocessors employ deeper pipelines and issue multiple instructions per cycle, they ar...
The access latency of branch predictors is a well known problem of fetch engine design. Prediction o...
Fetch engine performance is seriously limited by the branch prediction table access latency. This fa...
The stream fetch engine is a high-performance fetch architecture based on the concept of an instruct...
Abstract: Executing multiple threads has proved to be an effective solution to partially hide latenc...
Hard-to-predict branches depending on long-latency cache-misses have been recognized as a major perf...
This work presents several techniques for enlarging instruction streams. We call stream to a sequenc...
Modern microprocessors employ increasingly complicated branch predictors to achieve instruction fetc...
The next stream predictor is an accurate branch predictor that provides stream level sequencing. Eve...
Abstract — Executing multiple threads has proved to be an effective solution to partially hide laten...
A sequence of branch instructions in the dynamic instruction stream forms a branch sequence if at mo...
Modern superscalar processors rely on branch predictors to sustain a high instruction fetch throughp...
During the 1990s Two-level Adaptive Branch Predictors were developed to meet the requirement for acc...
Predictors were developed to meet the requirement for accurate branch prediction in high-performance...
During the 1990s Two-level Adaptive Branch Predictors were developed to meet the requirement for acc...
As modern microprocessors employ deeper pipelines and issue multiple instructions per cycle, they ar...