We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to Sparse Matrix Vector Multiplication (SpMV), the explicit heat equation PDE on a regular grid (Stencil), and a lattice Boltzmann application (LBMHD). We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Xeon Clovertown, AMD Opteron Barcelona, Sun Victoria Falls, and the Sony-Toshiba-IBM (STI) Cell. Rather than hand-tuning each kernel for each system, we develop a code generator for eac...
We apply auto-tuning to a hybrid MPI-pthreads lattice Boltzmann computation running on the Cray XT4 ...
This dissertation presents an architecture to accelerate sparse matrix linear algebra,which is among...
We analyze the efficiency of servers equipped with state-of-the-art general-purpose multicore proces...
We present an auto-tuning approach to optimize application performance on emerging multicore archite...
We present an auto-tuning approach to optimize application performance on emerging multicore archite...
We present an auto-tuning approach to optimize application performance on emerging multicore archite...
Understanding the most efficient design and utilization of emerging multicore systems is one of the ...
We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as...
In high-performance computing, excellent node-level performance is required for the efficient use of...
The recent transformation from an environment where gains in computational performance came from inc...
We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as...
Technology scaling trends have enabled the exponential growth of computing power. However, the perfo...
As power has become the pre-eminent design constraint for future HPC systems, computational efficien...
We are witnessing a rapid evolution of HPC node architectures and on-chip parallelism as power and c...
Dense linear algebra(DLA) is one of the most seven important kernels in high performance computing. ...
We apply auto-tuning to a hybrid MPI-pthreads lattice Boltzmann computation running on the Cray XT4 ...
This dissertation presents an architecture to accelerate sparse matrix linear algebra,which is among...
We analyze the efficiency of servers equipped with state-of-the-art general-purpose multicore proces...
We present an auto-tuning approach to optimize application performance on emerging multicore archite...
We present an auto-tuning approach to optimize application performance on emerging multicore archite...
We present an auto-tuning approach to optimize application performance on emerging multicore archite...
Understanding the most efficient design and utilization of emerging multicore systems is one of the ...
We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as...
In high-performance computing, excellent node-level performance is required for the efficient use of...
The recent transformation from an environment where gains in computational performance came from inc...
We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as...
Technology scaling trends have enabled the exponential growth of computing power. However, the perfo...
As power has become the pre-eminent design constraint for future HPC systems, computational efficien...
We are witnessing a rapid evolution of HPC node architectures and on-chip parallelism as power and c...
Dense linear algebra(DLA) is one of the most seven important kernels in high performance computing. ...
We apply auto-tuning to a hybrid MPI-pthreads lattice Boltzmann computation running on the Cray XT4 ...
This dissertation presents an architecture to accelerate sparse matrix linear algebra,which is among...
We analyze the efficiency of servers equipped with state-of-the-art general-purpose multicore proces...