We present a number of optimization techniques to compute prefix sums on linked lists and implement them on multithreaded GPUs using CUDA. Prefix computations on linked structures involve in general highly irregular fine grain memory accesses that are typical of many computations on linked lists, trees, and graphs. While the current generation of GPUs provides substantial computational power and extremely high bandwidth memory accesses, they may appear at first to be primarily geared toward streamed, highly data parallel computations. In this paper, we introduce an optimized multithreaded GPU algorithm for prefix computations through a randomization process that reduces the problem to a large number of fine-grain computations. We map these ...
This paper presents an algorithm for fast sorting of large lists using modern GPUs. The method achie...
Despite the fact that GPU was originally intended to be as a co-processor specializing in graphics r...
We describe the design of high-performance parallel radix sort and merge sort routines for manycore ...
We present a number of optimization techniques to compute prefix sums on linked lists and implement ...
General purpose programming on the graphics processing units (GPGPU) has received a lot of attention...
Modern Graphics Processing Units (GPUs) provide high computation power at low costs and have been de...
Graphics Processing Units (GPUs) are a fast evolving architecture. Over the last decade their progra...
Abstract—Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a s...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
Abstract — GPU based on CUDA Architecture developed by NVIDIA is a high performance computing device...
We present four CUDA based parallel implementations of the Space-Saving algorithm for determining fr...
CUDA is a parallel programming environment that enables significant performance improvement by lever...
We study the relationship between memory accesses, bank conflicts, thread multiplicity (also known a...
GPUs are an increasingly popular implementation platform for a variety of general purpose applicatio...
This paper describes a multi-threaded parallel design and implementation of the Smith-Waterman (SM) ...
This paper presents an algorithm for fast sorting of large lists using modern GPUs. The method achie...
Despite the fact that GPU was originally intended to be as a co-processor specializing in graphics r...
We describe the design of high-performance parallel radix sort and merge sort routines for manycore ...
We present a number of optimization techniques to compute prefix sums on linked lists and implement ...
General purpose programming on the graphics processing units (GPGPU) has received a lot of attention...
Modern Graphics Processing Units (GPUs) provide high computation power at low costs and have been de...
Graphics Processing Units (GPUs) are a fast evolving architecture. Over the last decade their progra...
Abstract—Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a s...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
Abstract — GPU based on CUDA Architecture developed by NVIDIA is a high performance computing device...
We present four CUDA based parallel implementations of the Space-Saving algorithm for determining fr...
CUDA is a parallel programming environment that enables significant performance improvement by lever...
We study the relationship between memory accesses, bank conflicts, thread multiplicity (also known a...
GPUs are an increasingly popular implementation platform for a variety of general purpose applicatio...
This paper describes a multi-threaded parallel design and implementation of the Smith-Waterman (SM) ...
This paper presents an algorithm for fast sorting of large lists using modern GPUs. The method achie...
Despite the fact that GPU was originally intended to be as a co-processor specializing in graphics r...
We describe the design of high-performance parallel radix sort and merge sort routines for manycore ...