Abstract An adaptive parallel matrix transpose algorithm optimized for distrib-uted multicore architectures running in a hybrid OpenMP/MPI configuration is pre-sented. Significant boosts in speed are observed relative to the distributed transpose used in the state-of-the-art adaptive FFTW library. In some cases, a hybrid config-uration allows one to reduce communication costs by reducing the number of MPI nodes, and thereby increasing message sizes. This also allows for a more slab-like than pencil-like domain decomposition for multidimensional Fast Fourier Trans-forms, reducing the cost of, or even eliminating the need for, a second distributed transpose. Nonblocking all-to-all transfers enable user computation and communi-cation to be ove...
AbstractThe development of the fast Fourier transform (FFT) and its numerous variants in the past 30...
We present a new method for performing global redistributions of multidimensional arrays essential t...
AbstractIn this article, we present a fast algorithm for matrix multiplication optimized for recent ...
This paper describes parallel matrix transpose algorithms on distributed memory concurrent processor...
We consider the problem of matrix transpose on mesh-connected processor networks. On the theoretical...
Data-partition and migration for efficient communication in distributed memory architectures are cri...
Abstract. We present an MPI based software library for computing fast Fourier transforms (FFTs) on m...
We present a MPI based software library for computing the fast Fourier transforms on massively paral...
The aim of this thesis is the study of different methods to minimize the communication overhead due ...
This paper presents a new and optimal parallel implementation of multidimensional fast Fourier trans...
. Fast Fourier transforms parallelize well but need large amounts of communication. An algorithm whi...
Transposing an N × N array that is distributed row- or column-wise across P = N processors is a fund...
Parallel matrix multiplication is one of the most studied fun-damental problems in distributed and h...
This paper presents implementations of in‐place algorithms for transposing rectangular matrices. One...
Computing the Fast Fourier Transform on a distributed memory architecture by a direct pipelined radi...
AbstractThe development of the fast Fourier transform (FFT) and its numerous variants in the past 30...
We present a new method for performing global redistributions of multidimensional arrays essential t...
AbstractIn this article, we present a fast algorithm for matrix multiplication optimized for recent ...
This paper describes parallel matrix transpose algorithms on distributed memory concurrent processor...
We consider the problem of matrix transpose on mesh-connected processor networks. On the theoretical...
Data-partition and migration for efficient communication in distributed memory architectures are cri...
Abstract. We present an MPI based software library for computing fast Fourier transforms (FFTs) on m...
We present a MPI based software library for computing the fast Fourier transforms on massively paral...
The aim of this thesis is the study of different methods to minimize the communication overhead due ...
This paper presents a new and optimal parallel implementation of multidimensional fast Fourier trans...
. Fast Fourier transforms parallelize well but need large amounts of communication. An algorithm whi...
Transposing an N × N array that is distributed row- or column-wise across P = N processors is a fund...
Parallel matrix multiplication is one of the most studied fun-damental problems in distributed and h...
This paper presents implementations of in‐place algorithms for transposing rectangular matrices. One...
Computing the Fast Fourier Transform on a distributed memory architecture by a direct pipelined radi...
AbstractThe development of the fast Fourier transform (FFT) and its numerous variants in the past 30...
We present a new method for performing global redistributions of multidimensional arrays essential t...
AbstractIn this article, we present a fast algorithm for matrix multiplication optimized for recent ...