This work presents the first extensive study of single- node performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multi- core systems. We consider single- and double-precision with numerous performance enhancements, including low-level tuning, numerical approximation, data structure transformations, Open MP parallelization, and algorithmic tuning. Among our numerous findings, we show that optimization and parallelization can improve double- precision performance by 25× on Intel's quad-core Nehalem, 9.4× on AMD's quad-core Barcelona, and 37.6× on Sun's Victoria Falls (dual-sockets on all systems). We also compare our single-precision version against our prior state-of-the-art GPU-based code and show, su...
In the last two decades, physical constraints in chip design have spawned a paradigm shift in comput...
This thesis presents a top to bottom analysis on designing and implementing fast algorithms for curr...
Understanding the most efficient design and utilization of emerging multicore systems is one of the ...
Among the algorithms that are likely to play a major role in future exascale computing, the fast mul...
The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at ...
International audienceLearn about the fast multipole method (FMM) and its optimization on NVIDIA GPU...
We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (F...
A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems Rio Yokota...
This paper presents an optimized CPU–GPU hybrid imple-mentation and a GPU performance model for the ...
The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at ...
We present efficient algorithms to build data structures and the lists needed for fast multipole met...
<b>Invited Lecture at the SIAM <i>"Encuentro Nacional de Ingeniería Matemática,"</i> at Pontificia U...
We present parallel versions of a representative N-body application that uses Greengard and Rokhlin&...
It has been shown that fast multipole methods can achieve good scalability on multi-core architectur...
Abstract. We discuss an implementation of adaptive fast multipole meth-ods targeting hybrid multicor...
In the last two decades, physical constraints in chip design have spawned a paradigm shift in comput...
This thesis presents a top to bottom analysis on designing and implementing fast algorithms for curr...
Understanding the most efficient design and utilization of emerging multicore systems is one of the ...
Among the algorithms that are likely to play a major role in future exascale computing, the fast mul...
The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at ...
International audienceLearn about the fast multipole method (FMM) and its optimization on NVIDIA GPU...
We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (F...
A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems Rio Yokota...
This paper presents an optimized CPU–GPU hybrid imple-mentation and a GPU performance model for the ...
The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at ...
We present efficient algorithms to build data structures and the lists needed for fast multipole met...
<b>Invited Lecture at the SIAM <i>"Encuentro Nacional de Ingeniería Matemática,"</i> at Pontificia U...
We present parallel versions of a representative N-body application that uses Greengard and Rokhlin&...
It has been shown that fast multipole methods can achieve good scalability on multi-core architectur...
Abstract. We discuss an implementation of adaptive fast multipole meth-ods targeting hybrid multicor...
In the last two decades, physical constraints in chip design have spawned a paradigm shift in comput...
This thesis presents a top to bottom analysis on designing and implementing fast algorithms for curr...
Understanding the most efficient design and utilization of emerging multicore systems is one of the ...