Among the algorithms that are likely to play a major role in future exascale computing, the fast multipole method (FMM) appears as a rising star. Our previous recent work showed scaling of an FMM on GPU clusters, with problem sizes in the order of billions of unknowns. That work led to an extremely parallel FMM, scaling to thousands of GPUs or tens of thousands of CPUs. This paper reports on a a campaign of performance tuning and scalability studies using multi-core CPUs, on the Kraken supercomputer. All kernels in the FMM were par-allelized using OpenMP, and a test using 107 particles randomly distributed in a cube showed 78 % efficiency on 8 threads. Tuning of the particle-to-particle kernel using SIMD instructions resulted in 4 × speed-u...
The Fast Multipole Method (FMM) is well known to possess a bottleneck arising from decreasing worklo...
This paper presents an optimized CPU–GPU hybrid imple-mentation and a GPU performance model for the ...
In the last two decades, physical constraints in chip design have spawned a paradigm shift in comput...
A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems Rio Yokota...
<b>Invited Lecture at the SIAM <i>"Encuentro Nacional de Ingeniería Matemática,"</i> at Pontificia U...
This thesis presents a top to bottom analysis on designing and implementing fast algorithms for curr...
The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at ...
We present efficient algorithms to build data structures and the lists needed for fast multipole met...
This work presents the first extensive study of single- node performance optimization, tuning, and a...
Fast summation methods like the FMM are the backbone of a multitude of simulations in MD, astrophysi...
Poster featured at the NVIDIA exhibit booth in the Supercomputing Conference, November 2011, Seattle...
The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at ...
We present parallel versions of a representative N-body application that uses Greengard and Rokhlin&...
International audienceLearn about the fast multipole method (FMM) and its optimization on NVIDIA GPU...
Abstract—The Fast Multipole Method (FMM) is considered as one of the top ten algorithms of the 20th ...
The Fast Multipole Method (FMM) is well known to possess a bottleneck arising from decreasing worklo...
This paper presents an optimized CPU–GPU hybrid imple-mentation and a GPU performance model for the ...
In the last two decades, physical constraints in chip design have spawned a paradigm shift in comput...
A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems Rio Yokota...
<b>Invited Lecture at the SIAM <i>"Encuentro Nacional de Ingeniería Matemática,"</i> at Pontificia U...
This thesis presents a top to bottom analysis on designing and implementing fast algorithms for curr...
The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at ...
We present efficient algorithms to build data structures and the lists needed for fast multipole met...
This work presents the first extensive study of single- node performance optimization, tuning, and a...
Fast summation methods like the FMM are the backbone of a multitude of simulations in MD, astrophysi...
Poster featured at the NVIDIA exhibit booth in the Supercomputing Conference, November 2011, Seattle...
The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at ...
We present parallel versions of a representative N-body application that uses Greengard and Rokhlin&...
International audienceLearn about the fast multipole method (FMM) and its optimization on NVIDIA GPU...
Abstract—The Fast Multipole Method (FMM) is considered as one of the top ten algorithms of the 20th ...
The Fast Multipole Method (FMM) is well known to possess a bottleneck arising from decreasing worklo...
This paper presents an optimized CPU–GPU hybrid imple-mentation and a GPU performance model for the ...
In the last two decades, physical constraints in chip design have spawned a paradigm shift in comput...