Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an in-visible, non-programmable array attribute. In reality, modern mem-ory systems are architecturally hierarchical rather than flat, with substantial differences in performance among different levels of the hierarchy. This mismatch between the model and the true archi-tecture of memory systems can result in low locality of reference and poor performance. Some of this loss in performance can be recovered by re-ordering computations using transformations such as loop tiling. We explore nonlin...
AbstractÐExploiting locality of references has become extremely important in realizing the potential...
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexityatthe expe...
Submitted for publication to IEEE TPDS The performance of both serial and parallel implementations o...
The trend in high-performance microprocessor design is toward increasing computational power on the ...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/16...
Hierarchically-blocked non-linear storage layouts, such as the Morton ordering, have been proposed a...
This paper introduces a dynamic layout optimization strategy to minimize the number of cycles spent ...
Two-dimensional arrays are generally arranged in memory in row-major order or column-major order. Tr...
The bandwidth mismatch between processor and main memory is one major limiting problem. Although str...
The importance of tiles or blocks in mathematics and thus computer science cannot be overstated. Fro...
Processor arrays can be used as accelerators for a plenty of data flow-dominant applications. The ex...
We present an original approach to automatic array alignment, the step in the hierarchical transform...
This article investigates the recursive Morton ordering of two-dimensional arrays as an efficient wa...
On modern computers, the performance of programs is often limited by memory latency rather than by p...
Abstract. Morton layout is a compromise storage layout between the programming language mandated lay...
AbstractÐExploiting locality of references has become extremely important in realizing the potential...
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexityatthe expe...
Submitted for publication to IEEE TPDS The performance of both serial and parallel implementations o...
The trend in high-performance microprocessor design is toward increasing computational power on the ...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/16...
Hierarchically-blocked non-linear storage layouts, such as the Morton ordering, have been proposed a...
This paper introduces a dynamic layout optimization strategy to minimize the number of cycles spent ...
Two-dimensional arrays are generally arranged in memory in row-major order or column-major order. Tr...
The bandwidth mismatch between processor and main memory is one major limiting problem. Although str...
The importance of tiles or blocks in mathematics and thus computer science cannot be overstated. Fro...
Processor arrays can be used as accelerators for a plenty of data flow-dominant applications. The ex...
We present an original approach to automatic array alignment, the step in the hierarchical transform...
This article investigates the recursive Morton ordering of two-dimensional arrays as an efficient wa...
On modern computers, the performance of programs is often limited by memory latency rather than by p...
Abstract. Morton layout is a compromise storage layout between the programming language mandated lay...
AbstractÐExploiting locality of references has become extremely important in realizing the potential...
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexityatthe expe...
Submitted for publication to IEEE TPDS The performance of both serial and parallel implementations o...