This paper presents a new approach to enabling loop fusion and tiling for arbitrary affine loop nests. Given a set of multiple loop nests, we present techniques that automatically eliminate all the fusion-preventing dependences by means of loop tiling and ar-ray copying. Applying our techniques iteratively to multiple loop nests yields a single loop nest that can be tiled for cache locality. Our approach handles LU, QR, Cholesky and Jacobi in a unified framework. Our experimental evaluation on an SGI Octane2 sys-tem shows that the benefit from the significantly reduced L1 and L2 cache misses has far more than offset the branching and loop control overhead introduced by our approach.
Because of the increasing gap between the speeds of processors and main memories, compilers must enh...
International audienceOur aim is to minimize the electrical energy used during the execution of sign...
Over the past 20 years, increases in processor speed have dramatically outstripped performance incre...
Tiling is a well-known loop transformation to improve temporal locality of nested loops. Current com...
Loop fusion improves data locality and reduces synchronization in data-parallel applications. Howeve...
Loop fusion is a reordering transformation that merges multiple loops into a single loop. It can inc...
This paper describes an algorithm to optimize cache locality in scientic codes on uniprocessor and m...
Modern processors use memory hierarchy of several levels. Achieving high performance mandates the ef...
The effectiveness of the memory hierarchy is critical for the performance of current processors. The...
In this paper, an efficient algorithm to implement loop partitioning is introduced and evaluated. We...
This thesis investigates compiler algorithms to transform program and data to utilize efficiently th...
Abstract: Loop fusion is recognized as an effective transformation for improving memory hierarchy pe...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
This paper presents hyperblocking, or hypertiling, a novel optimization technique that makes it poss...
AbstractÐExploiting locality of references has become extremely important in realizing the potential...
Because of the increasing gap between the speeds of processors and main memories, compilers must enh...
International audienceOur aim is to minimize the electrical energy used during the execution of sign...
Over the past 20 years, increases in processor speed have dramatically outstripped performance incre...
Tiling is a well-known loop transformation to improve temporal locality of nested loops. Current com...
Loop fusion improves data locality and reduces synchronization in data-parallel applications. Howeve...
Loop fusion is a reordering transformation that merges multiple loops into a single loop. It can inc...
This paper describes an algorithm to optimize cache locality in scientic codes on uniprocessor and m...
Modern processors use memory hierarchy of several levels. Achieving high performance mandates the ef...
The effectiveness of the memory hierarchy is critical for the performance of current processors. The...
In this paper, an efficient algorithm to implement loop partitioning is introduced and evaluated. We...
This thesis investigates compiler algorithms to transform program and data to utilize efficiently th...
Abstract: Loop fusion is recognized as an effective transformation for improving memory hierarchy pe...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
This paper presents hyperblocking, or hypertiling, a novel optimization technique that makes it poss...
AbstractÐExploiting locality of references has become extremely important in realizing the potential...
Because of the increasing gap between the speeds of processors and main memories, compilers must enh...
International audienceOur aim is to minimize the electrical energy used during the execution of sign...
Over the past 20 years, increases in processor speed have dramatically outstripped performance incre...