We present a simple and novel framework for generating blocked codes for high-performance machines with a memory hierarchy.Unlike traditional compiler techniques like tiling, which are based on reasoning about the control flow of programs, our techniques are based on reasoning directly about the flow of data through the memory hierarchy. Our data-centric transformations permit a more direct solution to the problem of enhancing data locality than current control-centric techniques do, and generalize easily to multiple levels of memory hierarchy. We buttress these claims with performance numbers for standard benchmarks from the problem domain of dense numerical linear algebra. The simplicity and intuitive appeal of our approach should make it...
For good performance of every computer program, good cache and TLB utilization is crucial. In numeri...
This thesis describes novel techniques and test implementations for optimizing numerically intensive...
Many applications are memory intensive and thus are bounded by memory latency and bandwidth. While i...
The trend in high-performance microprocessor design is toward increasing computational power on the ...
In order to mitigate the impact of the constantly widening gap between processor speed and main memo...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/16...
Block-recursive codes for dense numerical linear algebra computations appear to be well-suited for ...
On modern computers, the performance of programs is often limited by memory latency rather than by p...
This report has been developed over the work done in the deliverable [Nava94] There it was shown tha...
Abstract. This paper presents a new unified method for simultaneously tiling the register and cache ...
The gap between CPU speed and memory speed in modern computer systems is widening as new generations...
The gap between CPU speed and memory speed in modern com-puter systems is widening as new generation...
For many numerical codes the transport of data from main memory to the registers is com-monly consid...
This paper presents a technique for finding good distributions of arrays and suitable loop restructu...
The goal of the LAPACK project is to provide efficient and portable software for dense numerical lin...
For good performance of every computer program, good cache and TLB utilization is crucial. In numeri...
This thesis describes novel techniques and test implementations for optimizing numerically intensive...
Many applications are memory intensive and thus are bounded by memory latency and bandwidth. While i...
The trend in high-performance microprocessor design is toward increasing computational power on the ...
In order to mitigate the impact of the constantly widening gap between processor speed and main memo...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/16...
Block-recursive codes for dense numerical linear algebra computations appear to be well-suited for ...
On modern computers, the performance of programs is often limited by memory latency rather than by p...
This report has been developed over the work done in the deliverable [Nava94] There it was shown tha...
Abstract. This paper presents a new unified method for simultaneously tiling the register and cache ...
The gap between CPU speed and memory speed in modern computer systems is widening as new generations...
The gap between CPU speed and memory speed in modern com-puter systems is widening as new generation...
For many numerical codes the transport of data from main memory to the registers is com-monly consid...
This paper presents a technique for finding good distributions of arrays and suitable loop restructu...
The goal of the LAPACK project is to provide efficient and portable software for dense numerical lin...
For good performance of every computer program, good cache and TLB utilization is crucial. In numeri...
This thesis describes novel techniques and test implementations for optimizing numerically intensive...
Many applications are memory intensive and thus are bounded by memory latency and bandwidth. While i...