Locality-aware CTA Clustering for modern GPUs

Li, A.
Song, S.L.
Liu, W.
Liu, X.
Kumar, A.
Corporaal, H.

Open link

Publication date

June 2017

DOI

10.1145/3037697.3037709

Publisher

Association for Computing Machinery, Inc

Abstract

Cache is designed to exploit locality; however, the role of onchip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-shared L2 with long access latency; while the in-core locality, which is crucial for performance delivery, is handled explicitly by user-controlled scratchpad memory. In this work, we disclose another type of data locality that has been long ignored but with performance boosting potential - the inter-CTA locality. Exploiting such locality is rather challenging due to unclear hardware feasibility, unknown and inaccessible underlying CTA scheduler, and small incore cache capacity. To address these is...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Locality-aware CTA Clustering for modern GPUs

Abstract

Extracted data

Locality-aware CTA Clustering for modern GPUs

Abstract

Extracted data

Related items

Related items