Data-intensive applications put immense strain on the memory systems of Graphics Processing Units (GPUs). To cater to this need, GPU memory systems distribute requests across independent units to provide high bandwidth by servicing requests (mostly) in parallel. We find that this strategy breaks down for shared data structures because the shared Last-Level Cache (LLC) organization used by contemporary GPUs stores shared data in a single LLC slice. Shared data requests are hence serialized - resulting in data-intensive applications not being provided with the bandwidth they require. A private LLC organization can provide high bandwidth, but it is often undesirable since it significantly reduces the effective LLC capacity. In this work, we pr...
The Last-level cache (LLC) is one of the main GPU’s shared resources that contributes to improve per...
Next generation multicores will process massive data with varying degree of locality. Harnessing on-...
Graphics Processing Units (GPUs) have been shown to be effective at achieving large speedups over co...
Emerging GPU applications exhibit increasingly high computation demands which has led GPU manufactur...
Current GPU computing models support a mixture of coherent and incoherent classes of memory operatio...
Traditionally, GPUs only had programmer-managed caches. The advent of hardware-managed caches accele...
Next generation multicores will process massive data with varying degree of locality. Harnessing on-...
Abstract—On-chip caches are commonly used in computer systems to hide long off-chip memory access la...
Heterogeneous systems are ubiquitous in the field of High- Performance Computing (HPC). Graphics pro...
The reply network is a severe performance bottleneck in General Purpose Graphic Processing Units (GP...
This paper presents novel cache optimizations for massively parallel, throughput-oriented architectu...
Heterogeneous multicore processors that take full advantage of CPUs and GPUs within the same chip ra...
General-purpose Graphics Processing Units (GPGPUs) have shown enormous promise in enabling high thro...
To match the increasing computational demands of GPGPU applications and to improve peak compute thro...
Cache is designed to exploit locality; however, the role of onchip L1 data caches on modern GPUs is ...
The Last-level cache (LLC) is one of the main GPU’s shared resources that contributes to improve per...
Next generation multicores will process massive data with varying degree of locality. Harnessing on-...
Graphics Processing Units (GPUs) have been shown to be effective at achieving large speedups over co...
Emerging GPU applications exhibit increasingly high computation demands which has led GPU manufactur...
Current GPU computing models support a mixture of coherent and incoherent classes of memory operatio...
Traditionally, GPUs only had programmer-managed caches. The advent of hardware-managed caches accele...
Next generation multicores will process massive data with varying degree of locality. Harnessing on-...
Abstract—On-chip caches are commonly used in computer systems to hide long off-chip memory access la...
Heterogeneous systems are ubiquitous in the field of High- Performance Computing (HPC). Graphics pro...
The reply network is a severe performance bottleneck in General Purpose Graphic Processing Units (GP...
This paper presents novel cache optimizations for massively parallel, throughput-oriented architectu...
Heterogeneous multicore processors that take full advantage of CPUs and GPUs within the same chip ra...
General-purpose Graphics Processing Units (GPGPUs) have shown enormous promise in enabling high thro...
To match the increasing computational demands of GPGPU applications and to improve peak compute thro...
Cache is designed to exploit locality; however, the role of onchip L1 data caches on modern GPUs is ...
The Last-level cache (LLC) is one of the main GPU’s shared resources that contributes to improve per...
Next generation multicores will process massive data with varying degree of locality. Harnessing on-...
Graphics Processing Units (GPUs) have been shown to be effective at achieving large speedups over co...