International audienceTo keep up with a large degree of ILP, Itanium2 L2 cache system uses a complex organization scheme: load/store queues, banking and interleaving. In this paper, we study the impact of this cache system on memory instruction scheduling. We demonstrate that for scientific codes, "memory access vectorization" allows to generate very efficient code (up to the maximum of 4 loads per cycle). The impact of such "vectorization" on register pressure is analyzed: various register allocation schemes are proposed and evaluated
The instruction cache is a popular target for optimizations of microprocessor-based systems because ...
The technological improvements in silicon manufacturing are yielding vast increases of processor &ap...
Simulations of scientific programs running on traditional scientific computer architectures show tha...
International audienceTo keep up with a large degree of ILP, Itanium2 L2 cache system uses a complex...
International audienceTo keep up with a large degree of instruction level parallelism (ILP), the Ita...
International audienceMemory disambiguation mechanisms, coupled with load/store queues in out-of-ord...
The processor speeds continue to improve at a faster rate than the memory access times. The issue of...
In order to mitigate the impact of the constantly widening gap between processor speed and main memo...
The study and understanding of memory hierarchy behavior is essential, as it is critical to current ...
Obtaining high performance without machine-specific tuning is an important goal of scientific applic...
Abstract—Exploiting locality of reference is key to realizing high levels of performance on modern p...
This paper presents an experimental study on cache memory designs for vector computers. We use an ex...
The central data structures for many applications in scientific computing are large multidimensional...
In global scheduling for ILP processors, regionenlarging optimizations, especially tail duplication,...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
The instruction cache is a popular target for optimizations of microprocessor-based systems because ...
The technological improvements in silicon manufacturing are yielding vast increases of processor &ap...
Simulations of scientific programs running on traditional scientific computer architectures show tha...
International audienceTo keep up with a large degree of ILP, Itanium2 L2 cache system uses a complex...
International audienceTo keep up with a large degree of instruction level parallelism (ILP), the Ita...
International audienceMemory disambiguation mechanisms, coupled with load/store queues in out-of-ord...
The processor speeds continue to improve at a faster rate than the memory access times. The issue of...
In order to mitigate the impact of the constantly widening gap between processor speed and main memo...
The study and understanding of memory hierarchy behavior is essential, as it is critical to current ...
Obtaining high performance without machine-specific tuning is an important goal of scientific applic...
Abstract—Exploiting locality of reference is key to realizing high levels of performance on modern p...
This paper presents an experimental study on cache memory designs for vector computers. We use an ex...
The central data structures for many applications in scientific computing are large multidimensional...
In global scheduling for ILP processors, regionenlarging optimizations, especially tail duplication,...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
The instruction cache is a popular target for optimizations of microprocessor-based systems because ...
The technological improvements in silicon manufacturing are yielding vast increases of processor &ap...
Simulations of scientific programs running on traditional scientific computer architectures show tha...