In stencil based parallel applications, communications represent the main overhead, especially when targeting a fine grain parallelization in order to reduce the completion time. Techniques that minimize the number and the impact of communications are clearly relevant. In literature the best optimization reduces the number of communications per step from 3dim, featured by a naive implementation, to 2*dim, where dim is the number of the domain dimensions. To break down the previous bound, in the paper we introduce and formally prove Q-transformations, for stencils featuring data dependencies that can be expressed as geometric affine translations. Q-transformations, based on data dependencies orientations though space translations, lowers t...
In recent years, the use of accelerators in conjunction with CPUs, known as heterogeneous computing,...
Application codes reliably achieve performance far less than the advertised capabilities of existing...
Data reduction is a fundamental operation of parallel computing. We derive lower bounds on communica...
In stencil based parallel applications, communications represent the main overhead, especially when ...
This paper describes a compiler transformation on stencil operators that automatically converts a st...
This paper describes a new technique for optimizing serial and parallel stencil- and stencil-like op...
Minimizing communications when mapping affine loop nests onto distributed memory parallel computers ...
Minimizing communication overhead when mapping affine loop nests onto distributed memory parallel co...
Minimizing communications when mapping affine loop nests onto distributed memory parallel computers ...
In this thesis, we introduce a new optimization theory for stencil-based applications which is cente...
International audienceInterprocessor communication often dominates the runtime of large matrix compu...
Application codes reliably achieve performance far less than the advertised capabilities of existing...
Interprocessor communication often dominates the runtime of large matrix computations. We present a ...
Abstract. This paper proposes tiling techniques based on data depen-dencies and not in code structur...
Application codes reliably achieve performance far less than the advertised capabilities of existing...
In recent years, the use of accelerators in conjunction with CPUs, known as heterogeneous computing,...
Application codes reliably achieve performance far less than the advertised capabilities of existing...
Data reduction is a fundamental operation of parallel computing. We derive lower bounds on communica...
In stencil based parallel applications, communications represent the main overhead, especially when ...
This paper describes a compiler transformation on stencil operators that automatically converts a st...
This paper describes a new technique for optimizing serial and parallel stencil- and stencil-like op...
Minimizing communications when mapping affine loop nests onto distributed memory parallel computers ...
Minimizing communication overhead when mapping affine loop nests onto distributed memory parallel co...
Minimizing communications when mapping affine loop nests onto distributed memory parallel computers ...
In this thesis, we introduce a new optimization theory for stencil-based applications which is cente...
International audienceInterprocessor communication often dominates the runtime of large matrix compu...
Application codes reliably achieve performance far less than the advertised capabilities of existing...
Interprocessor communication often dominates the runtime of large matrix computations. We present a ...
Abstract. This paper proposes tiling techniques based on data depen-dencies and not in code structur...
Application codes reliably achieve performance far less than the advertised capabilities of existing...
In recent years, the use of accelerators in conjunction with CPUs, known as heterogeneous computing,...
Application codes reliably achieve performance far less than the advertised capabilities of existing...
Data reduction is a fundamental operation of parallel computing. We derive lower bounds on communica...