High clock frequencies combined with deep pipelining employed by many of the state-ofthe -art processors have forced cache hit accesses to be multi-cycle operations. For many programs, untolerated load latencies account for a significant portion of total execution time. In this paper, we present a mechanism called the Code Coalescing Unit (CCU) that can identify and eliminate at run-time several load operations. The multi-cycle load operations are converted to register read operations with zero latency. Our approach works by using a special buffer called the Store Register Rename Buffer (SRRB), to store the addresses and data for the stores. Subsequent loads for the data are avoided by the CCU by appropriate micro-architectural register ren...
The register file is one of the most critical datapath components limiting the number of threads tha...
Untolerated load instruction latencies often have a significant impact on overall program performanc...
In a wide superscalar processor, the amount of time it takes to execute an application depends on th...
By exploiting fine grain parallelism, superscalar processors can potentially increase the performanc...
By exploiting ne grain parallelism, superscalar processors can potentially increase the performance ...
The performance of the memory hierarchy has become one of the most critical elements in the performa...
In order to improve performance, future parallel systems will continue to increase the processing po...
Wide, deep pipelines need many physical registers to hold the results of in-flight instructions. Sim...
To reduce the average time needed to perform a read or a write access in a multiprocessor, a cache i...
In this paper, we present a novel mechanism that implements register renaming, dynamic speculation a...
A processor designer may wish for an implementation to support multiple register contexts for severa...
This paper presents a novel compiler directed technique to reduce the register pressure and power of...
Modern processors and compilers hide long memory latencies through non-blocking loads or explicit so...
Execution efficiency of memory instructions remains critically important. To this end, a plethora of...
The reorder buffer and register file of a modern superscalar processor are both critical components ...
The register file is one of the most critical datapath components limiting the number of threads tha...
Untolerated load instruction latencies often have a significant impact on overall program performanc...
In a wide superscalar processor, the amount of time it takes to execute an application depends on th...
By exploiting fine grain parallelism, superscalar processors can potentially increase the performanc...
By exploiting ne grain parallelism, superscalar processors can potentially increase the performance ...
The performance of the memory hierarchy has become one of the most critical elements in the performa...
In order to improve performance, future parallel systems will continue to increase the processing po...
Wide, deep pipelines need many physical registers to hold the results of in-flight instructions. Sim...
To reduce the average time needed to perform a read or a write access in a multiprocessor, a cache i...
In this paper, we present a novel mechanism that implements register renaming, dynamic speculation a...
A processor designer may wish for an implementation to support multiple register contexts for severa...
This paper presents a novel compiler directed technique to reduce the register pressure and power of...
Modern processors and compilers hide long memory latencies through non-blocking loads or explicit so...
Execution efficiency of memory instructions remains critically important. To this end, a plethora of...
The reorder buffer and register file of a modern superscalar processor are both critical components ...
The register file is one of the most critical datapath components limiting the number of threads tha...
Untolerated load instruction latencies often have a significant impact on overall program performanc...
In a wide superscalar processor, the amount of time it takes to execute an application depends on th...