Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs

Benoit, Anne
Le Fèvre, Valentin
Raghavan, Padma
Robert, Yves
Sun, Hongyang

Publication date

May 2020

Publisher

Institute of Electrical and Electronics Engineers (IEEE)

Abstract

International audienceThis paper focuses on the resilient scheduling of parallel jobs on high-performance computing (HPC) platforms to minimize the overall completion time, or makespan. We revisit the classical problem while assuming that jobs are subject to transient or silent errors, and hence may need to be re-executed each time they fail to complete successfully. This work generalizes the classical framework where jobs are known offline and do not fail: in the classical framework, list scheduling that gives priority to longest jobs is known to be a 3-approximation when imposing to use shelves, and a 2-approximation without this restriction. We show that when jobs can fail, using shelves can be arbitrarily bad, but unrestricted list sche...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs

Abstract

Extracted data

Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs

Abstract

Extracted data

Related items

Related items