Thread block compaction for efficient SIMT control flow

Wilson W. L. Fung
Tor M. Aamodt

Publication date

January 2011

DOI

10.1109/hpca.2011.5749714

Abstract

Manycore accelerators such as graphics processor units (GPUs) organize processing units into single-instruction, multiple data “cores ” to improve throughput per unit hardware cost. Programming models for these acceler-ators encourage applications to run kernels with large groups of parallel scalar threads. The hardware groups these threads into warps/wavefronts and executes them in lockstep—dubbed single-instruction, multiple-thread (SIMT) by NVIDIA. While current GPUs employ a per-warp (or per-wavefront) stack to manage divergent control flow, it incurs decreased efficiency for applications with nested, data-dependent control flow. In this paper, we propose and evaluate the benefits of extending the sharing of resources in a block of warp...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Thread block compaction for efficient SIMT control flow

Abstract

Extracted data

Thread block compaction for efficient SIMT control flow

Abstract

Extracted data

Related items

Related items