AbstractEfficient implementation of matrix algebra is important to the performance of many large and complex physical models. Among important tuning techniques is loop fusion which can reduce the amount of data moved between memory and the processor. We have developed the Build to Order (BTO) compiler to automate loop fusion for matrix algebra kernels. In this paper, we present BTO’s analytic memory model which substantially reduces the number of loop fusion options considered by the compiler. We introduce an example that motivates the inclusion of registers in the model. We demonstrate how the model’s modular design facilitates the addition of register allocation to the model’s set of memory components, improving its accuracy
Over the past decade, microprocessor design strategies have focused on increasing the computational ...
The final publication is available at Springer via http://dx.doi.org/10.1007/s10766-013-0249-6The in...
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexityatthe expe...
AbstractEfficient implementation of matrix algebra is important to the performance of many large and...
On modern processors, data transfer exceeds floating-point operations as the predominant cost in man...
Abstract—Scientific programmers often turn to vendor-tuned Basic Linear Algebra Subprograms (BLAS) t...
It is rare for a programmer to solve a numerical problem with a single library call; most problems r...
The memory bandwidth largely determines the performance of embedded systems. However, very often com...
The goal of the LAPACK project is to provide efficient and portable software for dense numerical lin...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
This paper describes an approach for the automatic generation and optimization of numerical softwar...
This thesis describes novel techniques and test implementations for optimizing numerically intensive...
Earth system modeling computations use stencils extensively while running many kernels. Optimal codi...
AbstractIn this work the behavior of the multithreaded implementation of some LAPACK routines on PLA...
Over the past 20 years, increases in processor speed have dramatically outstripped performance incre...
Over the past decade, microprocessor design strategies have focused on increasing the computational ...
The final publication is available at Springer via http://dx.doi.org/10.1007/s10766-013-0249-6The in...
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexityatthe expe...
AbstractEfficient implementation of matrix algebra is important to the performance of many large and...
On modern processors, data transfer exceeds floating-point operations as the predominant cost in man...
Abstract—Scientific programmers often turn to vendor-tuned Basic Linear Algebra Subprograms (BLAS) t...
It is rare for a programmer to solve a numerical problem with a single library call; most problems r...
The memory bandwidth largely determines the performance of embedded systems. However, very often com...
The goal of the LAPACK project is to provide efficient and portable software for dense numerical lin...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
This paper describes an approach for the automatic generation and optimization of numerical softwar...
This thesis describes novel techniques and test implementations for optimizing numerically intensive...
Earth system modeling computations use stencils extensively while running many kernels. Optimal codi...
AbstractIn this work the behavior of the multithreaded implementation of some LAPACK routines on PLA...
Over the past 20 years, increases in processor speed have dramatically outstripped performance incre...
Over the past decade, microprocessor design strategies have focused on increasing the computational ...
The final publication is available at Springer via http://dx.doi.org/10.1007/s10766-013-0249-6The in...
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexityatthe expe...