Efficient 3D Stencil Computations Using CUDA

Marcin Krotkiewski Marcin Dabrowski

Publication date

January 2011

Abstract

We present an efficient implementation of 7–point and 27–point stencils on high-end Nvidia GPUs. A new method of reading the data from the global memory to the shared memory of thread blocks is developed. The method avoids conditional statements and requires only two coalesced instructions to load the tile data with the halo. Additional optimizations include storing only one XY tile of data at a time in the shared memory to lower shared memory requirements, Common Subexpression Elimination to reduce the number of instructions, and software prefetching to overlap arithmetic and mem-ory instructions, and enhance latency hiding. The efficiency of our implementation is analyzed using a simple stencil memory footprint model that takes into accou...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Efficient 3D Stencil Computations Using CUDA

Abstract

Extracted data

Efficient 3D Stencil Computations Using CUDA

Abstract

Extracted data

Related items

Related items