This paper presents an optimized CPU–GPU hybrid imple-mentation and a GPU performance model for the kernel-independent fast multipole method (FMM). We implement an optimized kernel-independent FMM for GPUs, and com-bine it with our previous CPU implementation to create a hybrid CPU+GPU FMM kernel. When compared to an-other highly optimized GPU implementation, our implemen-tation achieves as much as a 1.9 × speedup. We then extend our previous lower bound analyses of FMM for CPUs to include GPUs. This yields a model for predicting the ex-ecution times of the different phases of FMM. Using this information, we estimate the execution times of a set of static hybrid schedules on a given system, which allows us to automatically choose the schedu...
Abstract—The Fast Multipole Method (FMM) is considered as one of the top ten algorithms of the 20th ...
Future high-performance computing systems will be hybrid; they will include processors optimized for...
With processor clock speeds having stagnated, parallel computing architectures have achieved a break...
The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at ...
International audienceLearn about the fast multipole method (FMM) and its optimization on NVIDIA GPU...
Among the algorithms that are likely to play a major role in future exascale computing, the fast mul...
This work presents the first extensive study of single- node performance optimization, tuning, and a...
We present efficient algorithms to build data structures and the lists needed for fast multipole met...
<b>Invited Lecture at the SIAM <i>"Encuentro Nacional de Ingeniería Matemática,"</i> at Pontificia U...
The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at ...
Computing on graphics processors is maybe one of the most important developments in computational sc...
Abstract. We discuss an implementation of adaptive fast multipole meth-ods targeting hybrid multicor...
International audienceFast Multipole Methods are a fundamental operation for the simulation of many ...
A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems Rio Yokota...
In the last two decades, physical constraints in chip design have spawned a paradigm shift in comput...
Abstract—The Fast Multipole Method (FMM) is considered as one of the top ten algorithms of the 20th ...
Future high-performance computing systems will be hybrid; they will include processors optimized for...
With processor clock speeds having stagnated, parallel computing architectures have achieved a break...
The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at ...
International audienceLearn about the fast multipole method (FMM) and its optimization on NVIDIA GPU...
Among the algorithms that are likely to play a major role in future exascale computing, the fast mul...
This work presents the first extensive study of single- node performance optimization, tuning, and a...
We present efficient algorithms to build data structures and the lists needed for fast multipole met...
<b>Invited Lecture at the SIAM <i>"Encuentro Nacional de Ingeniería Matemática,"</i> at Pontificia U...
The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at ...
Computing on graphics processors is maybe one of the most important developments in computational sc...
Abstract. We discuss an implementation of adaptive fast multipole meth-ods targeting hybrid multicor...
International audienceFast Multipole Methods are a fundamental operation for the simulation of many ...
A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems Rio Yokota...
In the last two decades, physical constraints in chip design have spawned a paradigm shift in comput...
Abstract—The Fast Multipole Method (FMM) is considered as one of the top ten algorithms of the 20th ...
Future high-performance computing systems will be hybrid; they will include processors optimized for...
With processor clock speeds having stagnated, parallel computing architectures have achieved a break...