Typical parallelization approaches such as OpenMP and CUDA provide constructs for parallelizing and blocking for data locality for individual loops. By focusing on each loop separately, these approaches fail to leverage sources of data locality possible due to inter-loop data reuse. The loop chain abstraction provides a framework for reasoning about and applying inter-loop optimizations. In this work, we incorporate the loop chain abstraction into RAJA, a performance portability layer for high-performance computing applications. Using the loop-chain-extended RAJA, or RAJALC, developers can have the RAJA library apply loop transformations like loop fusion and overlapped tiling while maintaining the original structure of their programs. By in...
In order to reduce remote memory accesses on CC-NUMA multiprocessors, we present an interprocedural ...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/16...
In recent years, methods for analyzing and parallelizing sequential code using data analysis and loo...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/18...
Over the past 20 years, increases in processor speed have dramatically outstripped performance incre...
In this tutorial, we address the problem of restructuring a (possibly sequential) program to improve...
Abstract. Loop fusion is a program transformation that merges multi-ple loops into one. It is eectiv...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
Exposing opportunities for parallelization while explicitly managing data locality is the primary ch...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
Parallel processing has been used to increase performance of computing systems for the past several ...
We present ALPyNA, an automatic loop parallelization framework for Python, which analyzes data depen...
Parallelizing compilers promise to exploit the parallelism available in a given program, particularl...
grantor: University of TorontoThis dissertation proposes and evaluates compiler techniques...
Abstract—Many scientific applications are organized in a data parallel way: as sequences of parallel...
In order to reduce remote memory accesses on CC-NUMA multiprocessors, we present an interprocedural ...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/16...
In recent years, methods for analyzing and parallelizing sequential code using data analysis and loo...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/18...
Over the past 20 years, increases in processor speed have dramatically outstripped performance incre...
In this tutorial, we address the problem of restructuring a (possibly sequential) program to improve...
Abstract. Loop fusion is a program transformation that merges multi-ple loops into one. It is eectiv...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
Exposing opportunities for parallelization while explicitly managing data locality is the primary ch...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
Parallel processing has been used to increase performance of computing systems for the past several ...
We present ALPyNA, an automatic loop parallelization framework for Python, which analyzes data depen...
Parallelizing compilers promise to exploit the parallelism available in a given program, particularl...
grantor: University of TorontoThis dissertation proposes and evaluates compiler techniques...
Abstract—Many scientific applications are organized in a data parallel way: as sequences of parallel...
In order to reduce remote memory accesses on CC-NUMA multiprocessors, we present an interprocedural ...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/16...
In recent years, methods for analyzing and parallelizing sequential code using data analysis and loo...