this paper, we describe a framework for loop transformations and code generation for NUMA (non-uniform memory access) machines. Most scalable parallel machines can be classified as NUMA machines because a processor can access data in its local memory ten to a thousand times faster than it can access non-local data. In addition, when a processor must make a number of accesses to data residing at a remote processor, it is usually more efficient to use block transfers of data rather than to use many small messages. Furthermore, each processor usually has a data cache. A system for programming these machines must tackle the following challenges: (1) expose and exploit parallelism in programs, (2) manage data to avoid making non-local accesses, ...
Loop vectorization, a key feature exploited to obtain high perfor-mance on Single Instruction Multip...
Current high-performance multicore processors provide users with a non-uniform memory access model (...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
A common feature of many scalable parallel machines is non-uniform memory access (NUMA) --- data acc...
A common feature of many scalable parallel machines is non-uniform memory access - a processor can ...
In this paper, we discuss a loop transformation framework that is based on integer non-singular mat...
In this paper, we discuss a loop transformation framework that is based on integer non-singular ma...
Many applications are memory intensive and thus are bounded by memory latency and bandwidth. While i...
The paper extends the framework of linear loop transformations adding a new nonlinear step at the tr...
In this tutorial, we address the problem of restructuring a (possibly sequential) program to improve...
This paper presents a technique for finding good distributions of arrays and suitable loop restructu...
Loop transformations are becoming critical to exploiting parallelism and data locality in paralleli...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/18...
Over the past 20 years, increases in processor speed have dramatically outstripped performance incre...
In this paper we generalize the framework of linear loop transformations: we consider loop alignment...
Loop vectorization, a key feature exploited to obtain high perfor-mance on Single Instruction Multip...
Current high-performance multicore processors provide users with a non-uniform memory access model (...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
A common feature of many scalable parallel machines is non-uniform memory access (NUMA) --- data acc...
A common feature of many scalable parallel machines is non-uniform memory access - a processor can ...
In this paper, we discuss a loop transformation framework that is based on integer non-singular mat...
In this paper, we discuss a loop transformation framework that is based on integer non-singular ma...
Many applications are memory intensive and thus are bounded by memory latency and bandwidth. While i...
The paper extends the framework of linear loop transformations adding a new nonlinear step at the tr...
In this tutorial, we address the problem of restructuring a (possibly sequential) program to improve...
This paper presents a technique for finding good distributions of arrays and suitable loop restructu...
Loop transformations are becoming critical to exploiting parallelism and data locality in paralleli...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/18...
Over the past 20 years, increases in processor speed have dramatically outstripped performance incre...
In this paper we generalize the framework of linear loop transformations: we consider loop alignment...
Loop vectorization, a key feature exploited to obtain high perfor-mance on Single Instruction Multip...
Current high-performance multicore processors provide users with a non-uniform memory access model (...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...