In order for collective communication routines to achieve high performance on different platforms, they must be able to adapt to the system architecture and use different algorithms for different situations. Current Message Passing Interface (MPI) implementations, such as MPICH and LAM/MPI, are not fully adaptable to the system architecture and are not able to achieve high performance on many platforms. In this paper, we present a system that produces efficient MPI collective communication routines. By automatically generating topology specific routines and using an empirical approach to select the best implementations, our system adapts to a given platform and constructs routines that are customized for the platform. The experimental resul...
Compiled communication has recently been proposed to improve communication performance for clusters ...
The message passing interface standard released in April 1994 by the MPI Forum [2], defines a set of...
We develop a message scheduling scheme that can theoretically achieve maximum throughput for all--t...
Collective communication is an important subset of Message Passing Interface. Improving the perform...
Abstract Many parallel applications from scientific computing use collective MPI communication oper-...
Parallel computing on clusters of workstations and personal computers has very high potential, sinc...
Parallel computing on clusters of workstations and personal computers has very high potential, since...
This work presents and evaluates algorithms for MPI collective communication operations on high perf...
Previous studies of application usage show that the per-formance of collective communications are cr...
Compiled communication has recently been proposed to improve communication performance for clusters ...
We have implemented eight of the MPI collective routines using MPI point-to-point communication rou...
In this paper we investigate a tunable MPI collective communications library on a cluster of SMPs. M...
The Message Passing Interface (MPI) is a standard in parallel computing, and can also be used as a h...
The performance of collective communication operations is one of the deciding factors in the overa...
We give an overview of the algorithms and implementations in the high-performance MPI libraries MPI/...
Compiled communication has recently been proposed to improve communication performance for clusters ...
The message passing interface standard released in April 1994 by the MPI Forum [2], defines a set of...
We develop a message scheduling scheme that can theoretically achieve maximum throughput for all--t...
Collective communication is an important subset of Message Passing Interface. Improving the perform...
Abstract Many parallel applications from scientific computing use collective MPI communication oper-...
Parallel computing on clusters of workstations and personal computers has very high potential, sinc...
Parallel computing on clusters of workstations and personal computers has very high potential, since...
This work presents and evaluates algorithms for MPI collective communication operations on high perf...
Previous studies of application usage show that the per-formance of collective communications are cr...
Compiled communication has recently been proposed to improve communication performance for clusters ...
We have implemented eight of the MPI collective routines using MPI point-to-point communication rou...
In this paper we investigate a tunable MPI collective communications library on a cluster of SMPs. M...
The Message Passing Interface (MPI) is a standard in parallel computing, and can also be used as a h...
The performance of collective communication operations is one of the deciding factors in the overa...
We give an overview of the algorithms and implementations in the high-performance MPI libraries MPI/...
Compiled communication has recently been proposed to improve communication performance for clusters ...
The message passing interface standard released in April 1994 by the MPI Forum [2], defines a set of...
We develop a message scheduling scheme that can theoretically achieve maximum throughput for all--t...