Blocked algorithms have much better properties of data locality and therefore can be much more efficient than ordinary algorithms when a memory hierarchy is involved. On the other hand, they are very difficult to write and to tune for particular machines. The reorganization is considered of nested loops through the use of known program transformations in order to create blocked algorithms automatically. The program transformations used are strip mining, loop interchange, and a variant of loop skewing in which invertible linear transformations (with integer coordinates) of the loop indices are allowed. Some problems are solved concerning the optimal application of these transformations. It is shown, in a very general setting, how to choose a...
To compile programs for message passing architectures and to obtain good performance on NUMA archit...
Although, computer system architecture and the throughput enhances continuously, the need for high c...
The trend in high-performance microprocessor design is toward increasing computational power on the ...
This paper describes an algorithm to optimize cache locality in scientific codes on uniprocessor and...
This paper presents a technique for finding good distributions of arrays and suitable loop restructu...
Over the past 20 years, increases in processor speed have dramatically outstripped performance incre...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
Global locality optimization is a technique for improving the cache performance of a sequence of loo...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/18...
Commercial link : http://www.springerlink.de/ ALCHEMY/http://www.springer.comCache memories were inv...
The effectiveness of the memory hierarchy is critical for the performance of current processors. The...
This thesis investigates compiler algorithms to transform program and data to utilize efficiently th...
this paper, we describe a framework for loop transformations and code generation for NUMA (non-unifo...
[[abstract]]Intensive scientific algorithms can usually be formulated as nested loops which are the ...
Abstract. For good performance of every computer program, good cache utiliza-tion is crucial. In num...
To compile programs for message passing architectures and to obtain good performance on NUMA archit...
Although, computer system architecture and the throughput enhances continuously, the need for high c...
The trend in high-performance microprocessor design is toward increasing computational power on the ...
This paper describes an algorithm to optimize cache locality in scientific codes on uniprocessor and...
This paper presents a technique for finding good distributions of arrays and suitable loop restructu...
Over the past 20 years, increases in processor speed have dramatically outstripped performance incre...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
Global locality optimization is a technique for improving the cache performance of a sequence of loo...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/18...
Commercial link : http://www.springerlink.de/ ALCHEMY/http://www.springer.comCache memories were inv...
The effectiveness of the memory hierarchy is critical for the performance of current processors. The...
This thesis investigates compiler algorithms to transform program and data to utilize efficiently th...
this paper, we describe a framework for loop transformations and code generation for NUMA (non-unifo...
[[abstract]]Intensive scientific algorithms can usually be formulated as nested loops which are the ...
Abstract. For good performance of every computer program, good cache utiliza-tion is crucial. In num...
To compile programs for message passing architectures and to obtain good performance on NUMA archit...
Although, computer system architecture and the throughput enhances continuously, the need for high c...
The trend in high-performance microprocessor design is toward increasing computational power on the ...