The efficient parallelization of fast multipole-based algorithms for the N-body problem is one of the most challenging topics in high performance scientific computing. The emergence of non-local, irregular communication patterns generated by these algorithms can easily create an insurmountable bottleneck on supercomputers with hundreds of thousands of cores. To overcome this obstacle we have developed an innovative parallelization strategy for Barnes-Hut tree codes on present and upcoming HPC multicore architectures. This scheme, based on a combined MPI-Pthreads approach, permits an efficient overlap of computation and data exchange. We highlight the capabilities of this method on the full IBM Blue Gene/P system JUGENE at inch Supercomputin...