We apply auto-tuning to a hybrid MPI-pthreads lattice Boltzmann computation running on the Cray XT4 at National Energy Research Scientific Computing Center (NERSC). Previous work showed that multicore-specific auto-tuning can improve the performance of lattice Boltzmann magnetohydrodynamics (LBMHD) by a factor of 4x when running on dual- and quad-core Opteron dual-socket SMPs. We extend these studies to the distributed memory arena via a hybrid MPI/pthreads implementation. In addition to conventional auto-tuning at the local SMP node, we tune at the message-passing level to determine the optimal aspect ratio as well as the correct balance between MPI tasks and threads per MPI task. Our study presents a detailed performance analysis when mov...
Energy consumption is a major concern with high performance multicore systems. In this paper, we exp...
In this paper we address the problem of identifying and exploiting techniques that optimize the perf...
We describe the implementation and optimization of a state-of-the-art Lattice Boltzmann code for com...
We apply auto-tuning to a hybrid MPI-pthreads lattice Boltzmann computation running on the Cray XT4 ...
We are witnessing a rapid evolution of HPC node architectures and on-chip parallelism as power and c...
We present an auto-tuning approach to optimize application performance on emerging multicore archite...
We present an auto-tuning approach to optimize application performance on emerging multicore archite...
GPUs deliver higher performance than traditional processors, offering remarkable energy efficiency, ...
We present an auto-tuning approach to optimize application performance on emerging multicore archite...
Heterogeneous clusters are a widely utilized class of supercomputers assembled from different types ...
In this paper we report on our early experience on porting, optimizing and benchmarking a Lattice Bo...
The last decade has witnessed a rapid proliferation of superscalarcache-based microprocessors to bui...
Abstract—Hybrid parallel programming models combining distributed and shared memory paradigms are we...
The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well suited to an efficient i...
AbstractIn this paper we report on our early experience on porting, optimizing and benchmarking a La...
Energy consumption is a major concern with high performance multicore systems. In this paper, we exp...
In this paper we address the problem of identifying and exploiting techniques that optimize the perf...
We describe the implementation and optimization of a state-of-the-art Lattice Boltzmann code for com...
We apply auto-tuning to a hybrid MPI-pthreads lattice Boltzmann computation running on the Cray XT4 ...
We are witnessing a rapid evolution of HPC node architectures and on-chip parallelism as power and c...
We present an auto-tuning approach to optimize application performance on emerging multicore archite...
We present an auto-tuning approach to optimize application performance on emerging multicore archite...
GPUs deliver higher performance than traditional processors, offering remarkable energy efficiency, ...
We present an auto-tuning approach to optimize application performance on emerging multicore archite...
Heterogeneous clusters are a widely utilized class of supercomputers assembled from different types ...
In this paper we report on our early experience on porting, optimizing and benchmarking a Lattice Bo...
The last decade has witnessed a rapid proliferation of superscalarcache-based microprocessors to bui...
Abstract—Hybrid parallel programming models combining distributed and shared memory paradigms are we...
The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well suited to an efficient i...
AbstractIn this paper we report on our early experience on porting, optimizing and benchmarking a La...
Energy consumption is a major concern with high performance multicore systems. In this paper, we exp...
In this paper we address the problem of identifying and exploiting techniques that optimize the perf...
We describe the implementation and optimization of a state-of-the-art Lattice Boltzmann code for com...