The Configuration Interaction (CI) method has been widely used to solve the non-relativistic many-body Schrodinger equation. One great challenge to implementing it efficiently on manycore architectures is its immense memory and data movement requirements. To address this issue, within each node, we exploit a hybrid MPI+OpenMP programming model in lieu of the traditional flat MPI programming model. In this paper, we develop optimizations that partition the workloads among OpenMP threads based on data locality,-which is essential in ensuring applications with complex data access patterns scale well on manycore architectures. The new algorithm scales to 256 threadson the 64-core Intel Knights Landing (KNL) manycore processor and 24 threads on ...
International audienceEfficiently programming shared-memory machines is a difficult challenge becaus...
Modern CMPs are designed to exploit both instruction-level parallelism within processors and threadl...
We introduce explicit multi-threading (XMT), a decentralized architecture that exploits fine-grained...
The Configuration Interaction (CI) method has been widely used to solve the non-relativistic many-bo...
Holistic tuning and optimization of hybrid MPI and OpenMP applications is becoming focus for paralle...
International audienceTo amortize the cost of MPI collective operations, nonblocking collectives hav...
International audienceTo amortize the cost of MPI collective operations, non-blocking collectives ha...
Abstract—With the increasing prominence of many-core archi-tectures and decreasing per-core resource...
Locality of computation is key to obtaining high performance on a broad variety of parallel architec...
Task parallelism as employed by the OpenMP task construct or some Intel Threading Building Blocks (T...
This paper presents some techniques for efficient thread forking and joining in parallel execution e...
Programming models such as CUDA and OpenCL allow the programmer to specify the independence of threa...
Currently, most scientific applications based on MPI adopt a compute-centric architecture. Needed da...
At the level of multi-core processors that share the same cache, data sharing among threads which be...
Multicore multiprocessors use Non Uniform Memory Ar-chitecture (NUMA) to improve their scalability. ...
International audienceEfficiently programming shared-memory machines is a difficult challenge becaus...
Modern CMPs are designed to exploit both instruction-level parallelism within processors and threadl...
We introduce explicit multi-threading (XMT), a decentralized architecture that exploits fine-grained...
The Configuration Interaction (CI) method has been widely used to solve the non-relativistic many-bo...
Holistic tuning and optimization of hybrid MPI and OpenMP applications is becoming focus for paralle...
International audienceTo amortize the cost of MPI collective operations, nonblocking collectives hav...
International audienceTo amortize the cost of MPI collective operations, non-blocking collectives ha...
Abstract—With the increasing prominence of many-core archi-tectures and decreasing per-core resource...
Locality of computation is key to obtaining high performance on a broad variety of parallel architec...
Task parallelism as employed by the OpenMP task construct or some Intel Threading Building Blocks (T...
This paper presents some techniques for efficient thread forking and joining in parallel execution e...
Programming models such as CUDA and OpenCL allow the programmer to specify the independence of threa...
Currently, most scientific applications based on MPI adopt a compute-centric architecture. Needed da...
At the level of multi-core processors that share the same cache, data sharing among threads which be...
Multicore multiprocessors use Non Uniform Memory Ar-chitecture (NUMA) to improve their scalability. ...
International audienceEfficiently programming shared-memory machines is a difficult challenge becaus...
Modern CMPs are designed to exploit both instruction-level parallelism within processors and threadl...
We introduce explicit multi-threading (XMT), a decentralized architecture that exploits fine-grained...