We describe a decomposition for in-place matrix transposi-tion, with applications to Array of Structures memory ac-cesses on SIMD processors. Traditional approaches to in-place matrix transposition involve cycle following, which is difficult to parallelize, and on matrices of dimension m by n require O(mn logmn) work when limited to less than O(mn) auxiliary space. Our decomposition allows the rows and columns to be operated on independently during in-place transposition, reducing work complexity to O(mn), given O(max(m,n)) auxiliary space. This decomposition leads to an efficient and naturally parallel algorithm: we have measured median throughput of 19.5 GB/s on an NVIDIA Tesla K20c processor. An implementation specialized for the skinny ...
The ability to perform permutations of large data sets in place reduces the amount of necessary avai...
Two parallel algorithms are proposed for the solution of the General Linear Model on a SIMD array pr...
The advances of Graphic Processing Units (GPU) technology and the introduction of CUDA program-ming ...
This paper presents implementations of in‐place algorithms for transposing rectangular matrices. One...
We develop a prototype library for in-place (dense) matrix storage for-mat conversion between the ca...
Matrix transposition is an important algorithmic building block for many numeric algorithms like m...
This thesis presents a novel algorithm for Transposing Rectangular matrices In-place and in Parallel...
International audienceModern computers keep following the traditional model of addressing memory lin...
Eklundh's (1972) algorithm to transpose a large matrix stored on an external device such as a disc h...
We consider the problem of matrix transpose on mesh-connected processor networks. On the theoretical...
This paper describes parallel matrix transpose algorithms on distributed memory concurrent processor...
Abstract—Many scientific applications involve operations on sparse matrices. However, due to irregul...
A memory architecture is presented. The memory architecture comprises a first memory and a second me...
Abstract An adaptive parallel matrix transpose algorithm optimized for distrib-uted multicore archit...
AbstractIt is proposed to enhance and simplify the programming of a two dimensional (2-D) torus (and...
The ability to perform permutations of large data sets in place reduces the amount of necessary avai...
Two parallel algorithms are proposed for the solution of the General Linear Model on a SIMD array pr...
The advances of Graphic Processing Units (GPU) technology and the introduction of CUDA program-ming ...
This paper presents implementations of in‐place algorithms for transposing rectangular matrices. One...
We develop a prototype library for in-place (dense) matrix storage for-mat conversion between the ca...
Matrix transposition is an important algorithmic building block for many numeric algorithms like m...
This thesis presents a novel algorithm for Transposing Rectangular matrices In-place and in Parallel...
International audienceModern computers keep following the traditional model of addressing memory lin...
Eklundh's (1972) algorithm to transpose a large matrix stored on an external device such as a disc h...
We consider the problem of matrix transpose on mesh-connected processor networks. On the theoretical...
This paper describes parallel matrix transpose algorithms on distributed memory concurrent processor...
Abstract—Many scientific applications involve operations on sparse matrices. However, due to irregul...
A memory architecture is presented. The memory architecture comprises a first memory and a second me...
Abstract An adaptive parallel matrix transpose algorithm optimized for distrib-uted multicore archit...
AbstractIt is proposed to enhance and simplify the programming of a two dimensional (2-D) torus (and...
The ability to perform permutations of large data sets in place reduces the amount of necessary avai...
Two parallel algorithms are proposed for the solution of the General Linear Model on a SIMD array pr...
The advances of Graphic Processing Units (GPU) technology and the introduction of CUDA program-ming ...