This paper presents hyperblocking, or hypertiling, a novel optimization technique that makes it possible to drastically eliminate the self and cross interference misses of tiled loop nests, while significantly reducing the overhead involved in copying data blocks. In hyperblocking, all arrays used in a tiled loop nest are reorganized at the beginning of the computation such that all data blocks are guaranteed to map into disjoint and contiguous cache regions. Doing this effectively eliminates the occurrence of interference misses in many dense-matrix computations. When data prefetching is used in combination with hyperblocking, it is possible to completely hide the latency of capacity misses, and given that no cache conflicts can occur, it ...
Abstract—Increasingly, the main bottleneck limiting performance on emerging multi-core and many-core...
We deal with compiler support for parallelizing perfectly nested loops for coarse-grain distributed ...
The speed of processors increases much faster than the memory access time. This makes memory accesse...
Blocking is a well-known optimization technique for improving the effectiveness of memory hierarchie...
. We address the problem of improving the data cache performance of numerical applications -- specif...
On modern computers, the performance of programs is often limited by memory latency rather than by p...
Abstract It has been observed that memory access performance can be improved by restructuring data d...
To date, data locality optimizing algorithms mostly aim at providing efficient strategies for blocki...
The effectiveness of the memory hierarchy is critical for the performance of current processors. The...
Tiling is a well-known loop transformation to improve temporal locality of nested loops. Current com...
Since the introduction of cache memories in computer architecture, techniques to improve the data lo...
Loop tiling is an effective optimizing transformation to boost the memory performance of a program, ...
This paper presents a new approach to enabling loop fusion and tiling for arbitrary affine loop nest...
Abstract. This paper presents a new unified method for simultaneously tiling the register and cache ...
In this paper, an efficient algorithm to implement loop partitioning is introduced and evaluated. We...
Abstract—Increasingly, the main bottleneck limiting performance on emerging multi-core and many-core...
We deal with compiler support for parallelizing perfectly nested loops for coarse-grain distributed ...
The speed of processors increases much faster than the memory access time. This makes memory accesse...
Blocking is a well-known optimization technique for improving the effectiveness of memory hierarchie...
. We address the problem of improving the data cache performance of numerical applications -- specif...
On modern computers, the performance of programs is often limited by memory latency rather than by p...
Abstract It has been observed that memory access performance can be improved by restructuring data d...
To date, data locality optimizing algorithms mostly aim at providing efficient strategies for blocki...
The effectiveness of the memory hierarchy is critical for the performance of current processors. The...
Tiling is a well-known loop transformation to improve temporal locality of nested loops. Current com...
Since the introduction of cache memories in computer architecture, techniques to improve the data lo...
Loop tiling is an effective optimizing transformation to boost the memory performance of a program, ...
This paper presents a new approach to enabling loop fusion and tiling for arbitrary affine loop nest...
Abstract. This paper presents a new unified method for simultaneously tiling the register and cache ...
In this paper, an efficient algorithm to implement loop partitioning is introduced and evaluated. We...
Abstract—Increasingly, the main bottleneck limiting performance on emerging multi-core and many-core...
We deal with compiler support for parallelizing perfectly nested loops for coarse-grain distributed ...
The speed of processors increases much faster than the memory access time. This makes memory accesse...