Computational task DAGs are executed on parallel computers by a task scheduling algorithm. Intelligent scheduling is critical for achieving high parallelism, low overheads and reduced communication. A key technique for load balancing task DAGs is work stealing (WS), which Blumofe et al. popularized for fork-join computations [2]. In scenarios of high parallel slackness, WS\u27s distributed nature allows it to scale to a large number of cores with low overhead [4]. However, the space of a WS computation grows proportionally to the number of cores. Targeting a lower bound, Blelloch et al. proposed the parallel-depth-first (PDF) scheduler [1]. PDF schedules tasks by following the depth-first (serial) order of computation and has space requirem...
Work-stealing systems are typically oblivious to the nature of the tasks they are scheduling. They d...
Emerging architecture designs include tens of processing cores on a single chip die; it is believed ...
This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlle...
In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good per...
In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good per...
The fork-join paradigm of concurrent expression has gained popularity in conjunction with work-steal...
In this paper we propose new insights into the problem of concurrently scheduling threads through ma...
Single threaded tasks are the basic unit of scheduling in modern runtimes targeting multicore hardwa...
Effective cache utilization is critical to performance in chip-multiprocessor systems (CMP). Modern ...
In systems with complex many-core cache hierarchy, exploiting data locality can significantly reduce...
112 pagesSince the end of Dennard’s scaling, computer architects have fully embraced parallelism to ...
Most parallel programs exhibit more parallelism than is available in processors pro-duced today. Whi...
Abstract—Load balancing techniques (e.g. work stealing) are important to obtain the best performance...
Multi-socket Multi-core architectures with shared caches in each socket have become mainstream when ...
Task-centric programming models offer a versatile method for exposing parallelism. Such programs are...
Work-stealing systems are typically oblivious to the nature of the tasks they are scheduling. They d...
Emerging architecture designs include tens of processing cores on a single chip die; it is believed ...
This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlle...
In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good per...
In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good per...
The fork-join paradigm of concurrent expression has gained popularity in conjunction with work-steal...
In this paper we propose new insights into the problem of concurrently scheduling threads through ma...
Single threaded tasks are the basic unit of scheduling in modern runtimes targeting multicore hardwa...
Effective cache utilization is critical to performance in chip-multiprocessor systems (CMP). Modern ...
In systems with complex many-core cache hierarchy, exploiting data locality can significantly reduce...
112 pagesSince the end of Dennard’s scaling, computer architects have fully embraced parallelism to ...
Most parallel programs exhibit more parallelism than is available in processors pro-duced today. Whi...
Abstract—Load balancing techniques (e.g. work stealing) are important to obtain the best performance...
Multi-socket Multi-core architectures with shared caches in each socket have become mainstream when ...
Task-centric programming models offer a versatile method for exposing parallelism. Such programs are...
Work-stealing systems are typically oblivious to the nature of the tasks they are scheduling. They d...
Emerging architecture designs include tens of processing cores on a single chip die; it is believed ...
This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlle...