A store queue (SQ) is a critical component of the load execution machinery. High ILP processors require high load execution bandwidth, but providing high bandwidth SQ access is difficult. Address banking, which works well for caches, conflicts with age-ordering which is required for the SQ and multi-porting exacerbates the latency of the associative searches that load execution requires. In this paper, we present a new high-bandwidth load-store unit design that exploits the predictability of forwarding behavior. To start with, a simple predictor filters loads that are not likely to require forwarding from accessing the SQ enabling a reduction in the number of associative ports. A subset of the loads that do not access the SQ are re-executed...
Multicore processors have emerged as a powerful platform on which to efficiently exploit thread-leve...
Virtually all processors today employ a store buffer (SB) to hide store latency. However, when the s...
The performance gap between processor and memory continues to remain a major performance bottleneck ...
A store queue (SQ) is a critical component of the load execution machinery. High ILP processors requ...
Conventional processors use a fully-associative store queue (SQ) to implement store-load forwarding....
Modern processors use CAM-based load and store queues (LQ/SQ) to support out-of-order memory schedul...
The load-store queue (LQ-SQ) of modem superscalar processors is responsible for keeping the order of...
The load-store unit is a performance critical component of a dynamically-scheduled processor. It is ...
Conventional dynamically scheduled processors often use fully associative structures named load/stor...
In an out-of-order core, the load queue (LQ), the store queue (SQ), and the store buffer (SB) are re...
Because they are based on large content-addressable memories, load-store queues (LSQs) present imple...
This paper presents NoSQ (short for No Store Queue), a microarchitecture that performs store-load co...
This paper describes several methods for improving the scalability of memory disambiguation hardware...
Conventional superscalar processors usually contain large CAM-based LSQ (load/store queue) with poor...
In most modern processor designs, the HW dedicated to store data and instructions (memory hierarchy)...
Multicore processors have emerged as a powerful platform on which to efficiently exploit thread-leve...
Virtually all processors today employ a store buffer (SB) to hide store latency. However, when the s...
The performance gap between processor and memory continues to remain a major performance bottleneck ...
A store queue (SQ) is a critical component of the load execution machinery. High ILP processors requ...
Conventional processors use a fully-associative store queue (SQ) to implement store-load forwarding....
Modern processors use CAM-based load and store queues (LQ/SQ) to support out-of-order memory schedul...
The load-store queue (LQ-SQ) of modem superscalar processors is responsible for keeping the order of...
The load-store unit is a performance critical component of a dynamically-scheduled processor. It is ...
Conventional dynamically scheduled processors often use fully associative structures named load/stor...
In an out-of-order core, the load queue (LQ), the store queue (SQ), and the store buffer (SB) are re...
Because they are based on large content-addressable memories, load-store queues (LSQs) present imple...
This paper presents NoSQ (short for No Store Queue), a microarchitecture that performs store-load co...
This paper describes several methods for improving the scalability of memory disambiguation hardware...
Conventional superscalar processors usually contain large CAM-based LSQ (load/store queue) with poor...
In most modern processor designs, the HW dedicated to store data and instructions (memory hierarchy)...
Multicore processors have emerged as a powerful platform on which to efficiently exploit thread-leve...
Virtually all processors today employ a store buffer (SB) to hide store latency. However, when the s...
The performance gap between processor and memory continues to remain a major performance bottleneck ...