Solving an N-body problem, electrostatic or gravitational, is a crucial task and the main computational bottleneck in manyscientific applications. Its direct solution is an ubiquitous showcase example for the compute power of graphics processingunits (GPUs). However, the naive pairwise summation hasOðN2Þcomputational complexity. The fast multipole method(FMM) can reduce runtime and complexity toOðNÞfor any specified precision. Here, we present a CUDA-accelerated,CþþFMM implementation for multi particle systems withr1potential that are found, e.g. in biomolecular simulations.The algorithm involves several operators to exchange information in an octree data structure. We focus on the Multipole-to-Local (M2L) operator, as its runtime is limiti...