Optimal usage of the memory system is a key element of fast GPU algorithms. Unfortunately many common algorithms fail in this regard despite exhibiting great regularity in memory access patterns. In this paper we propose efficient kernels to permute the elements of an array. We handle a class of permutations known as Bit Matrix Multiply Complement (BMMC) permutations, for which we design kernels of speed comparable to that of a simple array copy. This is a first step towards implementing a set of array combinators based on these permutations
Abstract—Modern high performance processors require memory systems that can provide access to data a...
Many fast algorithms in arithmetic complexity have hierarchical or recursive structures that make ef...
Accessing the memory efficiently to keep up with the data processing rate is a well known problem in...
This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O opera...
Due to non-associativity of floating-point operations and dynamic scheduling on parallel architectur...
The objective of high performance computing (HPC) is to ensure that the computational power of hardw...
The FM-index is a data structure which is seeing more and more pervasive use, in particular in the f...
International audienceDue to non-associativity of floating-point operations and dynamic schedu...
With the advent of programmer-friendly GPU computing environments, there has been much interest in o...
Permutation-based indexing is one of the most popular techniques for the approximate nearest-neighbo...
Matrix transposition is an important algorithmic building block for many numeric algorithms like m...
Similarity searching is a useful operation for many real applications that work on non-structured or...
The recent advent of high-throughput sequencing machines producing big amounts of short reads has bo...
Increasingly, modern computing problems, including many scientific and business applications, requir...
Given a regular application described by a system of uniform recurrence equations, systolic arrays a...
Abstract—Modern high performance processors require memory systems that can provide access to data a...
Many fast algorithms in arithmetic complexity have hierarchical or recursive structures that make ef...
Accessing the memory efficiently to keep up with the data processing rate is a well known problem in...
This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O opera...
Due to non-associativity of floating-point operations and dynamic scheduling on parallel architectur...
The objective of high performance computing (HPC) is to ensure that the computational power of hardw...
The FM-index is a data structure which is seeing more and more pervasive use, in particular in the f...
International audienceDue to non-associativity of floating-point operations and dynamic schedu...
With the advent of programmer-friendly GPU computing environments, there has been much interest in o...
Permutation-based indexing is one of the most popular techniques for the approximate nearest-neighbo...
Matrix transposition is an important algorithmic building block for many numeric algorithms like m...
Similarity searching is a useful operation for many real applications that work on non-structured or...
The recent advent of high-throughput sequencing machines producing big amounts of short reads has bo...
Increasingly, modern computing problems, including many scientific and business applications, requir...
Given a regular application described by a system of uniform recurrence equations, systolic arrays a...
Abstract—Modern high performance processors require memory systems that can provide access to data a...
Many fast algorithms in arithmetic complexity have hierarchical or recursive structures that make ef...
Accessing the memory efficiently to keep up with the data processing rate is a well known problem in...