This paper explores the interplay between algorithm design and a computer's memory hierarchy. Matrix transpose and the bit-reversal reordering are important scientific subroutines which often exhibit severe performance degradation due to cache and TLB associativity problems. We give lower bounds that show for typical memory hierarchy designs, extra data movement is unavoidable. We also prescribe characteristics of various levels of the memory hierarchy needed to perform efficient bit-reversals. Insight gained from our analysis leads to the design of a near optimal bit-reversal algorithm. This Cache Optimal Bit Reverse Algorithm (COBRA) is implemented on the Digital Alpha 21164, Sun Ultrasparc 2, and IBM Power2. We show that COBRA is ne...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/16...
Abstract-- In this work, the performance of basic and strassen’s matrix multiplication algorithms ar...
Journal ArticleConventional microarchitectures choose a single memory hierarchy design point target...
With the increasing demand for online/inline data processing efficient Fourier analysis becomes more...
This paper discusses a bit-vector implementation of an algorithm that computes an optimal sequence ...
This brief presents novel circuits for calculating the bit reversal on parallel data. The circuits c...
Many versions of the Fast Fourier Trans� form require a reordering of either the in� put or the outp...
We present a model that enables us to analyze the running time of an algorithm on a computer with a ...
Abstract: The Fast Fourier Transform is incomplete without bitreversal. Novel parallel circuits for ...
The traditional permutation multiplication algorithm is now limited by memory latency and not by CPU...
Journal ArticleAlthough microprocessor performance continues to increase at a rapid pace, the growin...
Abstract This paper presents asymptotically optimal algo-rithms for rectangular matrix transpose, FF...
We describe a reversible Instruction Set Architecture using recently developed reversible logic desi...
The trend in high-performance microprocessor design is toward increasing computational power on the ...
A wide variety of Fast Fourier Transform (FFT) algorithms employ a bit reversal method for the reord...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/16...
Abstract-- In this work, the performance of basic and strassen’s matrix multiplication algorithms ar...
Journal ArticleConventional microarchitectures choose a single memory hierarchy design point target...
With the increasing demand for online/inline data processing efficient Fourier analysis becomes more...
This paper discusses a bit-vector implementation of an algorithm that computes an optimal sequence ...
This brief presents novel circuits for calculating the bit reversal on parallel data. The circuits c...
Many versions of the Fast Fourier Trans� form require a reordering of either the in� put or the outp...
We present a model that enables us to analyze the running time of an algorithm on a computer with a ...
Abstract: The Fast Fourier Transform is incomplete without bitreversal. Novel parallel circuits for ...
The traditional permutation multiplication algorithm is now limited by memory latency and not by CPU...
Journal ArticleAlthough microprocessor performance continues to increase at a rapid pace, the growin...
Abstract This paper presents asymptotically optimal algo-rithms for rectangular matrix transpose, FF...
We describe a reversible Instruction Set Architecture using recently developed reversible logic desi...
The trend in high-performance microprocessor design is toward increasing computational power on the ...
A wide variety of Fast Fourier Transform (FFT) algorithms employ a bit reversal method for the reord...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/16...
Abstract-- In this work, the performance of basic and strassen’s matrix multiplication algorithms ar...
Journal ArticleConventional microarchitectures choose a single memory hierarchy design point target...