In this paper we revisit the design of concurrent data structures -- specifically queues -- and examine their performance portabilitywith regard to the move from conventional CPUs to graphics processors. We have looked at both lock-based and lock-free algorithmsand have, for comparison, implemented and optimized the same algorithms on both graphics processors and multi-core CPUs.Particular interest has been paid to study the difference between the old Tesla and the new Fermi and Kepler architecturesin this context.We provide a comprehensive evaluation and analysis of our implementations on all examined platforms.Our results indicate that the queues are in general performance portable, but that platform specific optimizations are possibleto ...
The high computational throughput of modern graphics processing units (GPUs) make them the de-facto ...
This study analyzes the efficiency of parallel computational applications with the adoption of recen...
Most multiprocessors are multiprogrammed to achieve acceptable response time and to increase their u...
In this paper we revisit the design of concurrent data structures -- specifically queues -- and exam...
Synchronization of concurrent threads is the central problem in order to design efficient concurrent...
As core counts increase and as heterogeneity becomes more common in parallel computing, we face the ...
Abstract. In this work, we study the scalability, performance, design and implementation of basic da...
The convergence of highly parallel many-core graphics processors with conventional multi-core proces...
Concurrent data structures provide the means to multi-threaded applications to share data. Typical d...
The concurrent priority queue is one of the shared memory data structures that can be dynamically ma...
\ua9 2017 by John Wiley & Sons, Inc. All rights reserved. Concurrent data structures are the data sh...
In this paper, we revisit the design of synchronization primitives---specifically barriers, mutexes,...
The efficiency of concurrent data structures is crucial to the performance of multi-threaded program...
This paper investigates the synchronization power of coalesced memory accesses, a family of memory a...
Most multiprocessors are multiprogrammed to achieve acceptable response time. Unfortunately, inoppor...
The high computational throughput of modern graphics processing units (GPUs) make them the de-facto ...
This study analyzes the efficiency of parallel computational applications with the adoption of recen...
Most multiprocessors are multiprogrammed to achieve acceptable response time and to increase their u...
In this paper we revisit the design of concurrent data structures -- specifically queues -- and exam...
Synchronization of concurrent threads is the central problem in order to design efficient concurrent...
As core counts increase and as heterogeneity becomes more common in parallel computing, we face the ...
Abstract. In this work, we study the scalability, performance, design and implementation of basic da...
The convergence of highly parallel many-core graphics processors with conventional multi-core proces...
Concurrent data structures provide the means to multi-threaded applications to share data. Typical d...
The concurrent priority queue is one of the shared memory data structures that can be dynamically ma...
\ua9 2017 by John Wiley & Sons, Inc. All rights reserved. Concurrent data structures are the data sh...
In this paper, we revisit the design of synchronization primitives---specifically barriers, mutexes,...
The efficiency of concurrent data structures is crucial to the performance of multi-threaded program...
This paper investigates the synchronization power of coalesced memory accesses, a family of memory a...
Most multiprocessors are multiprogrammed to achieve acceptable response time. Unfortunately, inoppor...
The high computational throughput of modern graphics processing units (GPUs) make them the de-facto ...
This study analyzes the efficiency of parallel computational applications with the adoption of recen...
Most multiprocessors are multiprogrammed to achieve acceptable response time and to increase their u...