3D-stacked memory devices with processing logic can help alleviate the memory bandwidth bottleneck in GPUs. However, in order for such Near-Data Processing (NDP) memory stacks to be used for different GPU architectures, it is desirable to standardize the NDP architecture. Our proposal enables this standardization by allowing data to be spread across multiple memory stacks as is the norm in high-performance systems without an MMU on the NDP stack. The keys to this architecture are the ability to move data between memory stacks as required for computation, and a partitioned execution mechanism that offloads memory-intensive application segments onto the NDP stack and decouples address translation from DRAM accesses. By enhancing this system w...
Graphics processing units (GPUs) have become prevalent in modern computing systems. While their high...
The conventional approach of moving data to the CPU for computation has become a significant perform...
The conventional approach of moving data to the CPU for computation has become a significant perform...
Graphics Processing Units is one of the most widely adopted parallel computing engines for modern ap...
Recent technology advances in memory system design, along with 3D stacking, have made near-data proc...
Main memory bandwidth is a critical bottleneck for modern GPU systems due to limited off-chip pin ba...
Abstract—The end of Dennard scaling has made all sys-tems energy-constrained. For data-intensive app...
As the performance of DRAM devices falls more and more behind computing capabilities, the limitation...
The exponential growth of the dataset size demanded by modern big data applications requires innovat...
While Processing-in-Memory has been investigated for decades, it has not been embraced commercially....
The limitations of DRAM technology in terms of energy consumption and Bandwidth poses a serious prob...
A large fraction of MapReduce execution time is spent processing the Map phase, and a large fraction...
For the past two decades, the scaling of main memory lags behind the advancement of computation in a...
Over the last decades, a tremendous change toward using information technology in almost every daily...
The cost of transferring data between the off-chip memory system and compute unit is the fundamental...
Graphics processing units (GPUs) have become prevalent in modern computing systems. While their high...
The conventional approach of moving data to the CPU for computation has become a significant perform...
The conventional approach of moving data to the CPU for computation has become a significant perform...
Graphics Processing Units is one of the most widely adopted parallel computing engines for modern ap...
Recent technology advances in memory system design, along with 3D stacking, have made near-data proc...
Main memory bandwidth is a critical bottleneck for modern GPU systems due to limited off-chip pin ba...
Abstract—The end of Dennard scaling has made all sys-tems energy-constrained. For data-intensive app...
As the performance of DRAM devices falls more and more behind computing capabilities, the limitation...
The exponential growth of the dataset size demanded by modern big data applications requires innovat...
While Processing-in-Memory has been investigated for decades, it has not been embraced commercially....
The limitations of DRAM technology in terms of energy consumption and Bandwidth poses a serious prob...
A large fraction of MapReduce execution time is spent processing the Map phase, and a large fraction...
For the past two decades, the scaling of main memory lags behind the advancement of computation in a...
Over the last decades, a tremendous change toward using information technology in almost every daily...
The cost of transferring data between the off-chip memory system and compute unit is the fundamental...
Graphics processing units (GPUs) have become prevalent in modern computing systems. While their high...
The conventional approach of moving data to the CPU for computation has become a significant perform...
The conventional approach of moving data to the CPU for computation has become a significant perform...