Abstract: Loop fusion is recognized as an effective transformation for improving memory hierarchy performance. However, unconstrained loop fusion can lead to poor performance because of increased register pressure and cache conflict misses. In this paper, we present a cache-conscious analytical model for profitable loop fusion. We use this model to tune fusion parameters for different architectures through empirical search. Experiments on four different platforms for a set of applications show significant speedup over fully optimized code generated by state-of-the-art commercial compilers
Loop fusion is a reordering transformation that merges multiple loops into a single loop. It can inc...
In this lecture we consider loop transformations that can be used for cache optimization. The transf...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/19...
Traditional compilers are limited in their ability to optimize applications for different architectu...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
Loop fusion improves data locality and reduces synchronization in data-parallel applications. Howeve...
Because of the increasing gap between the speeds of processors and main memories, compilers must enh...
Modern processors use memory hierarchy of several levels. Achieving high performance mandates the ef...
UnrestrictedWe are facing an increasing performance gap between processor and memory speed on today'...
© 1994 ACM. In the past decade, processor speed has become significantly faster than memory speed. S...
Abstract. In recent years, a number of strategies have emerged for em-pirically tuning applications ...
The study and understanding of memory hierarchy behavior is essential, as it is critical to current ...
The memory bandwidth largely determines the performance of embedded systems. However, very often com...
We present a novel, compile-time method for determining the cache performance of the loop nests in a...
Loop fusion is a reordering transformation that merges multiple loops into a single loop. It can inc...
In this lecture we consider loop transformations that can be used for cache optimization. The transf...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/19...
Traditional compilers are limited in their ability to optimize applications for different architectu...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
Loop fusion improves data locality and reduces synchronization in data-parallel applications. Howeve...
Because of the increasing gap between the speeds of processors and main memories, compilers must enh...
Modern processors use memory hierarchy of several levels. Achieving high performance mandates the ef...
UnrestrictedWe are facing an increasing performance gap between processor and memory speed on today'...
© 1994 ACM. In the past decade, processor speed has become significantly faster than memory speed. S...
Abstract. In recent years, a number of strategies have emerged for em-pirically tuning applications ...
The study and understanding of memory hierarchy behavior is essential, as it is critical to current ...
The memory bandwidth largely determines the performance of embedded systems. However, very often com...
We present a novel, compile-time method for determining the cache performance of the loop nests in a...
Loop fusion is a reordering transformation that merges multiple loops into a single loop. It can inc...
In this lecture we consider loop transformations that can be used for cache optimization. The transf...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/19...