M.: A decomposition for in-place matrix transposition

Bryan Catanzaro
Alexander Keller
Michael Garland

Publication date

January 2014

Abstract

We describe a decomposition for in-place matrix transposi-tion, with applications to Array of Structures memory ac-cesses on SIMD processors. Traditional approaches to in-place matrix transposition involve cycle following, which is difficult to parallelize, and on matrices of dimension m by n require O(mn logmn) work when limited to less than O(mn) auxiliary space. Our decomposition allows the rows and columns to be operated on independently during in-place transposition, reducing work complexity to O(mn), given O(max(m,n)) auxiliary space. This decomposition leads to an efficient and naturally parallel algorithm: we have measured median throughput of 19.5 GB/s on an NVIDIA Tesla K20c processor. An implementation specialized for the skinny ...