Conventional processors use a fully-associative store queue (SQ) to implement store-load forwarding. Associative search latency does not scale well to capacities and bandwidths required by wide-issue, large window processors. In this work, we improve SQ scalability by implementing store-load forwarding using speculative indexed access rather than associative search. Our design uses prediction to identify the single SQ entry from which each dynamic load is most likely to forward. When a load executes, it either obtains its value from the predicted SQ entry (if the address of the entry matches the load address) or the data cache (otherwise). A forwarding mis-prediction — detected by pre-commit filtered load re-execution — results in a pipelin...
This paper presents NoSQ (short for No Store Queue), a microarchitecture that performs store-load co...
Abstract—As FPGAs continue to increase in size, it becomes increasingly feasible and desirable to bu...
Various memory consistency model implementations (e.g., x86, SPARC) willfully allow a core to see it...
Conventional processors use a fully-associative store queue (SQ) to implement store-load forwarding....
A store queue (SQ) is a critical component of the load execution machinery. High ILP processors requ...
A store queue (SQ) is a critical component of the load execution machinery. High ILP processors requ...
Modern processors use CAM-based load and store queues (LQ/SQ) to support out-of-order memory schedul...
Conventional dynamically scheduled processors often use fully associative structures named load/stor...
In an out-of-order core, the load queue (LQ), the store queue (SQ), and the store buffer (SB) are re...
Speculative parallelization (SP) enables a processor to extract multiple threads from a single seque...
Conventional superscalar processors usually contain large CAM-based LSQ (load/store queue) with poor...
The load-store unit is a performance critical component of a dynamically-scheduled processor. It is ...
Because they are based on large content-addressable memories, load-store queues (LSQs) present imple...
Efficient data supply to the processor is the one of the keys to achieve high performance. However, ...
. Data speculation refers to the execution of an instruction before some logically preceding instruc...
This paper presents NoSQ (short for No Store Queue), a microarchitecture that performs store-load co...
Abstract—As FPGAs continue to increase in size, it becomes increasingly feasible and desirable to bu...
Various memory consistency model implementations (e.g., x86, SPARC) willfully allow a core to see it...
Conventional processors use a fully-associative store queue (SQ) to implement store-load forwarding....
A store queue (SQ) is a critical component of the load execution machinery. High ILP processors requ...
A store queue (SQ) is a critical component of the load execution machinery. High ILP processors requ...
Modern processors use CAM-based load and store queues (LQ/SQ) to support out-of-order memory schedul...
Conventional dynamically scheduled processors often use fully associative structures named load/stor...
In an out-of-order core, the load queue (LQ), the store queue (SQ), and the store buffer (SB) are re...
Speculative parallelization (SP) enables a processor to extract multiple threads from a single seque...
Conventional superscalar processors usually contain large CAM-based LSQ (load/store queue) with poor...
The load-store unit is a performance critical component of a dynamically-scheduled processor. It is ...
Because they are based on large content-addressable memories, load-store queues (LSQs) present imple...
Efficient data supply to the processor is the one of the keys to achieve high performance. However, ...
. Data speculation refers to the execution of an instruction before some logically preceding instruc...
This paper presents NoSQ (short for No Store Queue), a microarchitecture that performs store-load co...
Abstract—As FPGAs continue to increase in size, it becomes increasingly feasible and desirable to bu...
Various memory consistency model implementations (e.g., x86, SPARC) willfully allow a core to see it...