This dissertation describes work on the architecture of throughput-oriented accelerator processors. First, we examine the limitations of current accelerator processors and identify an opportunity to enable high throughput while also providing a more general-purpose programming model.To address this opportunity,we present Rigel, a single-chip accelerator architecture with 1024 independent processing cores targeted at a broad class of data- and task-parallel computation. Enabled by the feasibility of large die sizes combined with increasing transistor densities, we show that such an aggressive design can be implemented in today's process technology within acceptable area and power limits. We discuss our motivation for such a design and e...
Accelerators, such as GPUs and Intel Xeon Phis, have become the workhorses of high-performance compu...
To help shrink the programmability-performance efficiency gap, we discuss that adaptive runtime syst...
This dissertation presents a novel decoupled latency tolerance technique for 1000-core data parallel...
In this thesis, I describe the evaluation framework for Rigel, a 1024-core single-chip accelerator ...
......Increasing demand for perfor-mance on data-intensive parallel workloads has driven the design ...
The Rigel compute accelerator has been developed to explore alternative architectures for massively ...
There is a large, emerging, and commercially relevant class of applications which stands to be enabl...
This thesis describes the efficient design of a future many-core processor that can provide higher p...
Future performance improvements must come from the exploitation of concurrency at all levels. Recen...
Faced with nearly stagnant clock speed advances, chip manufacturers have turned to parallelism as th...
The continued miniaturization of the technology node increases not only the chip capacity but also t...
textRecent graphics processing units (GPUs) have emerged as a promising platform for general purpose...
The purpose of this paper is to show that multi-threading techniques can be applied to a vector proc...
This work describes a cache architecture and memory model for 1000+ core microprocessors. Our appro...
Parallel programming requires a significant amount of developer effort, and creating optimized paral...
Accelerators, such as GPUs and Intel Xeon Phis, have become the workhorses of high-performance compu...
To help shrink the programmability-performance efficiency gap, we discuss that adaptive runtime syst...
This dissertation presents a novel decoupled latency tolerance technique for 1000-core data parallel...
In this thesis, I describe the evaluation framework for Rigel, a 1024-core single-chip accelerator ...
......Increasing demand for perfor-mance on data-intensive parallel workloads has driven the design ...
The Rigel compute accelerator has been developed to explore alternative architectures for massively ...
There is a large, emerging, and commercially relevant class of applications which stands to be enabl...
This thesis describes the efficient design of a future many-core processor that can provide higher p...
Future performance improvements must come from the exploitation of concurrency at all levels. Recen...
Faced with nearly stagnant clock speed advances, chip manufacturers have turned to parallelism as th...
The continued miniaturization of the technology node increases not only the chip capacity but also t...
textRecent graphics processing units (GPUs) have emerged as a promising platform for general purpose...
The purpose of this paper is to show that multi-threading techniques can be applied to a vector proc...
This work describes a cache architecture and memory model for 1000+ core microprocessors. Our appro...
Parallel programming requires a significant amount of developer effort, and creating optimized paral...
Accelerators, such as GPUs and Intel Xeon Phis, have become the workhorses of high-performance compu...
To help shrink the programmability-performance efficiency gap, we discuss that adaptive runtime syst...
This dissertation presents a novel decoupled latency tolerance technique for 1000-core data parallel...