On modern processors, data transfer exceeds floating-point operations as the predominant cost in many linear algebra computations. For these memory-bound calculations, reducing data movement is often the only way to significantly increase their speed. One tuning technique that focuses on reducing memory accesses is loop fusion. However, determining the optimum amount of loop fusion to apply to a routine is difficult as fusion can both positively and negatively impact memory traffic. In this thesis, we perform an in depth analysis of how loop fusion affects data movement throughout the memory hierarchy. The results of this analysis are used to create a memory model for fused linear algebra calculations. The model predicts data movement throu...
Over the past decade, microprocessor design strategies have focused on increasing the computational ...
While CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth h...
As the demand increases for high performance and power efficiency in modern computer runtime systems...
AbstractEfficient implementation of matrix algebra is important to the performance of many large and...
The memory bandwidth largely determines the performance of embedded systems. However, very often com...
The Tensor Contraction Engine (TCE) is a compiler that translates high-level, mathematical tensor co...
Abstract: Loop fusion is recognized as an effective transformation for improving memory hierarchy pe...
Over the past 20 years, increases in processor speed have dramatically outstripped performance incre...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/18...
The recent dramatic progress in machine learning is partially attributed to the availability of high...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/19...
The recent dramatic progress in machine learning is partially attributed to the availability of high...
The recent dramatic progress in machine learning is partially attributed to the availability of high...
The recent dramatic progress in machine learning is partially attributed to the availability of high...
The benefits of high level approach to parallel programming are well understood and are often desire...
Over the past decade, microprocessor design strategies have focused on increasing the computational ...
While CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth h...
As the demand increases for high performance and power efficiency in modern computer runtime systems...
AbstractEfficient implementation of matrix algebra is important to the performance of many large and...
The memory bandwidth largely determines the performance of embedded systems. However, very often com...
The Tensor Contraction Engine (TCE) is a compiler that translates high-level, mathematical tensor co...
Abstract: Loop fusion is recognized as an effective transformation for improving memory hierarchy pe...
Over the past 20 years, increases in processor speed have dramatically outstripped performance incre...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/18...
The recent dramatic progress in machine learning is partially attributed to the availability of high...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/19...
The recent dramatic progress in machine learning is partially attributed to the availability of high...
The recent dramatic progress in machine learning is partially attributed to the availability of high...
The recent dramatic progress in machine learning is partially attributed to the availability of high...
The benefits of high level approach to parallel programming are well understood and are often desire...
Over the past decade, microprocessor design strategies have focused on increasing the computational ...
While CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth h...
As the demand increases for high performance and power efficiency in modern computer runtime systems...