To get maximum performance on the many-core graphics processorsit is important to have an even balance of the workload so thatall processing units contribute equally to the task at hand.This can be hard to achieve when the cost of a task is notknown beforehand and when new sub-tasks are created dynamicallyduring execution. With the recent advent of scatter operationsand atomic hardware primitives it is now possible to bring someof the more elaborate dynamic load balancing schemes from theconventional SMP systems domain to the graphics processordomain.We have compared four different dynamic load balancing methodsto see which one is most suited to the highly parallel world ofgraphics processors. Three of these methods were lock-free andone wa...
In this paper we revisit the design of concurrent data structures -- specifically queues -- and exam...
. In this paper, we present a cohesive, practical load balancing framework that addresses many short...
A parallel concurrent application runs most efficiently and quickly when the workload is distributed...
To get maximum performance on the many-core graphics processors it is important to have an even bala...
Abstract — To get maximum performance on the many-core graphics processors, it is important to have ...
In this chapter, we present a methodology for efficient load balancing of computational problems tha...
In this paper we present GPU-Quicksort, an efficientQuicksort algorithm suitable for highly parallel...
The convergence of highly parallel many-core graphics processors with conventional multi-core proces...
The computational power provided by many-core graph-ics processing units (GPUs) has been exploited i...
Multicomputer systems based on message passing draw attractions in the field of high performance co...
In parallel computing, obtaining maximal performance is often mandatory to solve large and complex p...
We propose a GPU fine-grained load-balancing abstraction that decouples load balancing from work pro...
Programs developed under the Compute Unified Device Architecture obtain the highest performance rate...
A large class of computational problems are characterised by frequent synchronisation, and computati...
The overall efficiency of parallel algorithms is most decisively effected by the strategy applied fo...
In this paper we revisit the design of concurrent data structures -- specifically queues -- and exam...
. In this paper, we present a cohesive, practical load balancing framework that addresses many short...
A parallel concurrent application runs most efficiently and quickly when the workload is distributed...
To get maximum performance on the many-core graphics processors it is important to have an even bala...
Abstract — To get maximum performance on the many-core graphics processors, it is important to have ...
In this chapter, we present a methodology for efficient load balancing of computational problems tha...
In this paper we present GPU-Quicksort, an efficientQuicksort algorithm suitable for highly parallel...
The convergence of highly parallel many-core graphics processors with conventional multi-core proces...
The computational power provided by many-core graph-ics processing units (GPUs) has been exploited i...
Multicomputer systems based on message passing draw attractions in the field of high performance co...
In parallel computing, obtaining maximal performance is often mandatory to solve large and complex p...
We propose a GPU fine-grained load-balancing abstraction that decouples load balancing from work pro...
Programs developed under the Compute Unified Device Architecture obtain the highest performance rate...
A large class of computational problems are characterised by frequent synchronisation, and computati...
The overall efficiency of parallel algorithms is most decisively effected by the strategy applied fo...
In this paper we revisit the design of concurrent data structures -- specifically queues -- and exam...
. In this paper, we present a cohesive, practical load balancing framework that addresses many short...
A parallel concurrent application runs most efficiently and quickly when the workload is distributed...