This paper addresses performance portability of MPI code on multiprogrammed shared memory machines. Conventional MPI implementations map each MPI node to an OS process, which suffers severe performance degradation in multiprogrammed environments. Our previous work (TMPI) has developed compile/run-time techniques to support threaded MPI execution by mapping each MPI node to a kernel thread. However, kernel threads have context switch cost higher than user-level threads and this leads to longer spinning time requirement during MPI synchronization. This paper presents an adaptive two-level thread scheme for MPI to reduce context switch and synchronization cost. This scheme also exposes thread scheduling information at user-level, which allows ...
International audienceTo amortize the cost of MPI collective operations, non-blocking collectives ha...
As high-end computing systems continue to grow in scale, recent advances in multi- and many-core arc...
We present a user-level thread scheduler for shared-memory multiprocessors, and we analyze its perfo...
MPI is a message-passing standard widely used for developing high-performance parallel applications....
MPI-based explicitly parallel programs have been widely used for developing highperformance applicat...
The new generation of parallel applications are complex, involve simulation of dynamically varying s...
Threading support for Message Passing Interface (MPI) has been defined in the MPI standard for more ...
Abstract—With the increasing prominence of many-core archi-tectures and decreasing per-core resource...
Many-core architectures, such as the Intel Xeon Phi, provide dozens of cores and hundreds of hardwar...
Proceedings of: First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014...
Hybrid MPI+Threads programming has emerged as an alternative model to the “MPI everywhere ” model to...
International audienceNon-blocking collectives have been proposed so as to allow communications to b...
Abstract. To make the most effective use of parallel machines that are being built out of increasing...
Thread level parallelism of applications is commonly exploited using multi-thread processors. In suc...
Supercomputing applications rely on strong scaling to achieve faster results on a larger number of p...
International audienceTo amortize the cost of MPI collective operations, non-blocking collectives ha...
As high-end computing systems continue to grow in scale, recent advances in multi- and many-core arc...
We present a user-level thread scheduler for shared-memory multiprocessors, and we analyze its perfo...
MPI is a message-passing standard widely used for developing high-performance parallel applications....
MPI-based explicitly parallel programs have been widely used for developing highperformance applicat...
The new generation of parallel applications are complex, involve simulation of dynamically varying s...
Threading support for Message Passing Interface (MPI) has been defined in the MPI standard for more ...
Abstract—With the increasing prominence of many-core archi-tectures and decreasing per-core resource...
Many-core architectures, such as the Intel Xeon Phi, provide dozens of cores and hundreds of hardwar...
Proceedings of: First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014...
Hybrid MPI+Threads programming has emerged as an alternative model to the “MPI everywhere ” model to...
International audienceNon-blocking collectives have been proposed so as to allow communications to b...
Abstract. To make the most effective use of parallel machines that are being built out of increasing...
Thread level parallelism of applications is commonly exploited using multi-thread processors. In suc...
Supercomputing applications rely on strong scaling to achieve faster results on a larger number of p...
International audienceTo amortize the cost of MPI collective operations, non-blocking collectives ha...
As high-end computing systems continue to grow in scale, recent advances in multi- and many-core arc...
We present a user-level thread scheduler for shared-memory multiprocessors, and we analyze its perfo...