This paper presents an architecture-independent method for performing BMMC permutations on multiprocessors with distributed memory. All interprocessor communication uses the MPI function MPI_Sendrecv_replace(). The number of elements and number of processors must be powers of 2, with at least one element per processor, and there is no inherent upper bound on the ratio of elements per processor. Our method transmits only data without transmitting any source or target indices, which conserves network bandwidth. When data is transmitted, the source and target processors implicitly agree on each other\u27s identity and the indices of the elements being transmitted. A C-callable implementation of our method is available from Netlib. The implemen...
This paper presents communication-efficient algorithms for the external data redistribution problem....
Many parallel applications from scientific computing use MPI collective communication operations to ...
We present a new fast and scalable matrix multiplication algorithm, called DIMMA (Distribution-Indep...
The authors implemented and measured several methods to perform BMMC permutations on the MasPar MP-2...
This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O opera...
This report considers the problem of writing data distribution independent (DDI) programs in order t...
Increasingly, modern computing problems, including many scientific and business applications, requir...
This talk discusses optimized collective algorithms and the benefits of leveraging independent hardw...
In exascale computing era, applications are executed at larger scale than ever before, whichresults ...
This paper describes parallel matrix transpose algorithms on distributed memory concurrent processor...
BCS MPI proposes a new approach to design the communication libraries for large scale parallel machi...
We give asymptotically equal lower and upper bounds for the number of parallel I/O operations requir...
An algorithm for parallel generation of a random permutation of a large set of distinct integers is ...
MPI is widely used for programming large HPC clusters. MPI also includes persistent operations, whic...
The Message Passing Interface (MPI) is a widely used standard for inter-processor communications in ...
This paper presents communication-efficient algorithms for the external data redistribution problem....
Many parallel applications from scientific computing use MPI collective communication operations to ...
We present a new fast and scalable matrix multiplication algorithm, called DIMMA (Distribution-Indep...
The authors implemented and measured several methods to perform BMMC permutations on the MasPar MP-2...
This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O opera...
This report considers the problem of writing data distribution independent (DDI) programs in order t...
Increasingly, modern computing problems, including many scientific and business applications, requir...
This talk discusses optimized collective algorithms and the benefits of leveraging independent hardw...
In exascale computing era, applications are executed at larger scale than ever before, whichresults ...
This paper describes parallel matrix transpose algorithms on distributed memory concurrent processor...
BCS MPI proposes a new approach to design the communication libraries for large scale parallel machi...
We give asymptotically equal lower and upper bounds for the number of parallel I/O operations requir...
An algorithm for parallel generation of a random permutation of a large set of distinct integers is ...
MPI is widely used for programming large HPC clusters. MPI also includes persistent operations, whic...
The Message Passing Interface (MPI) is a widely used standard for inter-processor communications in ...
This paper presents communication-efficient algorithms for the external data redistribution problem....
Many parallel applications from scientific computing use MPI collective communication operations to ...
We present a new fast and scalable matrix multiplication algorithm, called DIMMA (Distribution-Indep...