Presentation given at EGU 2020 - doi.org/10.5194/egusphere-egu2020-9732 In the roadmap of modern parallel architectures development, the computing power of a node grows much more quickly than main memory performance (capacity, bandwidth). This leads to an even much higher gap between computing and memory resources. An efficient use of the cache memory is becoming ever more essential as optimization technique. The NEMO model uses a finite difference integration method and a regular cartesian grid for space discretization. The NEMO code reflects this choice: a generic field is represented in memory as a 3D array; and the code is mainly composed of three-level nested loops. These loops often include only a few operations in the body; the resu...
We present the internal representation and optimizations used by the CASH compiler for improving the...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/16...
A common feature of many scalable parallel machines is non-uniform memory access (NUMA) --- data acc...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
grantor: University of TorontoThis dissertation proposes and evaluates compiler techniques...
This thesis investigates compiler algorithms to transform program and data to utilize efficiently th...
In 2019 a non-intrusive instrumentation of the NEMO code, aimed to give information about the MPI co...
The memory bandwidth largely determines the performance of embedded systems. However, very often com...
Loop fusion is a reordering transformation that merges multiple loops into a single loop. It can inc...
Earth system modeling computations use stencils extensively while running many kernels. Optimal codi...
Over the past decade, microprocessor design strategies have focused on increasing the computational ...
© 1994 ACM. In the past decade, processor speed has become significantly faster than memory speed. S...
Abstract: Loop fusion is recognized as an effective transformation for improving memory hierarchy pe...
Modern processors use memory hierarchy of several levels. Achieving high performance mandates the ef...
We present the internal representation and optimizations used by the CASH compiler for improving the...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/16...
A common feature of many scalable parallel machines is non-uniform memory access (NUMA) --- data acc...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
In the past decade, processor speed has become significantly faster than memory speed. Small, fast c...
grantor: University of TorontoThis dissertation proposes and evaluates compiler techniques...
This thesis investigates compiler algorithms to transform program and data to utilize efficiently th...
In 2019 a non-intrusive instrumentation of the NEMO code, aimed to give information about the MPI co...
The memory bandwidth largely determines the performance of embedded systems. However, very often com...
Loop fusion is a reordering transformation that merges multiple loops into a single loop. It can inc...
Earth system modeling computations use stencils extensively while running many kernels. Optimal codi...
Over the past decade, microprocessor design strategies have focused on increasing the computational ...
© 1994 ACM. In the past decade, processor speed has become significantly faster than memory speed. S...
Abstract: Loop fusion is recognized as an effective transformation for improving memory hierarchy pe...
Modern processors use memory hierarchy of several levels. Achieving high performance mandates the ef...
We present the internal representation and optimizations used by the CASH compiler for improving the...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/16...
A common feature of many scalable parallel machines is non-uniform memory access (NUMA) --- data acc...