Preserving memory locality is a major issue in highly-multithreaded architectures such as GPUs. These architectures hide latency by maintaining a large number of threads in flight. As each thread needs to maintain a private working set, all threads collectively put tremendous pressure on on-chip memory arrays, at significant cost in area and power. We show that thread-private data in GPU-like implicit SIMD architectures can be compressed by a factor up to 16 by taking advantage of correlations between values held by different threads. We propose the Affine Vector Cache, a compressed cache design that complements the first level cache. Evaluation by simulation on the SDK and Rodinia benchmarks shows that a 32KB L1 cache assisted by a 16KB AV...
Massively parallel processing devices, like Graphics Processing Units (GPUs), have the ability to ac...
As GPU's compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. C...
As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, t...
The performance gap between computer processors and memory bandwidth is severely limiting the throug...
The massive parallelism provided by general-purpose GPUs (GPGPUs) possessing numerous compute thread...
This paper presents novel cache optimizations for massively parallel, throughput-oriented architectu...
Abstract—In a GPU, all threads within a warp execute the same instruction in lockstep. For a memory ...
The usage of Graphics Processing Units (GPUs) as an application accelerator has become increasingly ...
To match the increasing computational demands of GPGPU applications and to improve peak compute thro...
On-chip cache memories are instrumental in tackling several performance and energy issues facing con...
Current GPU computing models support a mixture of coherent and incoherent classes of memory operatio...
Pervasive use of GPUs across multiple disciplines is a result of continuous adaptation of the GPU a...
Emerging GPU applications exhibit increasingly high computation demands which has led GPU manufactur...
On-chip caches are essential as they bridge the growing speed-gap between off-chip memory and proces...
Data-intensive applications put immense strain on the memory systems of Graphics Processing Units (G...
Massively parallel processing devices, like Graphics Processing Units (GPUs), have the ability to ac...
As GPU's compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. C...
As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, t...
The performance gap between computer processors and memory bandwidth is severely limiting the throug...
The massive parallelism provided by general-purpose GPUs (GPGPUs) possessing numerous compute thread...
This paper presents novel cache optimizations for massively parallel, throughput-oriented architectu...
Abstract—In a GPU, all threads within a warp execute the same instruction in lockstep. For a memory ...
The usage of Graphics Processing Units (GPUs) as an application accelerator has become increasingly ...
To match the increasing computational demands of GPGPU applications and to improve peak compute thro...
On-chip cache memories are instrumental in tackling several performance and energy issues facing con...
Current GPU computing models support a mixture of coherent and incoherent classes of memory operatio...
Pervasive use of GPUs across multiple disciplines is a result of continuous adaptation of the GPU a...
Emerging GPU applications exhibit increasingly high computation demands which has led GPU manufactur...
On-chip caches are essential as they bridge the growing speed-gap between off-chip memory and proces...
Data-intensive applications put immense strain on the memory systems of Graphics Processing Units (G...
Massively parallel processing devices, like Graphics Processing Units (GPUs), have the ability to ac...
As GPU's compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. C...
As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, t...