International audienceThis paper focuses on the resilient scheduling of parallel jobs on high-performance computing (HPC) platforms to minimize the overall completion time, or the makespan. We revisit the classical problem while assuming that jobs are subject to failures caused by transient or silent errors, and hence may need to be re-executed each time they fail to complete successfully. This work generalizes the classical framework where jobs are known offline and do not fail: in this framework, list scheduling that gives priority to the longest jobs is known to be a 3approximation when imposing to use shelves, and a 2-approximation without this restriction. We show that when jobs can fail, using shelves can be arbitrarily bad, but unres...
textabstractWhen jobs have to be processed on a set of identical parallel machines so as to minimize...
The application of computers in safety-critical systems is expanding rapidly. With reliability speci...
International audienceHeterogeneous distributed systems are widely deployed for executing computatio...
International audienceThis paper focuses on the resilient scheduling of parallel jobs on high-perfor...
International audienceThis paper focuses on the resilient scheduling of parallel jobs on high-perfor...
International audienceWe study the resilient scheduling of moldable parallel jobs on high-performanc...
International audienceThis paper focuses on the resilient scheduling of moldable parallel jobs on hi...
International audienceApplications implemented on critical systems are subject to both safety critic...
International audienceThe optimization of parallel applications is difficult to achieve by classical...
(eng) Abstract Most list scheduling heuristics rely on a simple platform model where communication c...
International audienceScheduling in High-Performance Computing (HPC) has been traditionally centered...
Scheduling deteriorating jobs on parallel machines is an NP-hard problem, for which heuristics would...
In this paper, we study a scheduling problem with unreliable jobs. Each job is characterized by a su...
textabstractWhen jobs have to be processed on a set of identical parallel machines so as to minimize...
The application of computers in safety-critical systems is expanding rapidly. With reliability speci...
International audienceHeterogeneous distributed systems are widely deployed for executing computatio...
International audienceThis paper focuses on the resilient scheduling of parallel jobs on high-perfor...
International audienceThis paper focuses on the resilient scheduling of parallel jobs on high-perfor...
International audienceWe study the resilient scheduling of moldable parallel jobs on high-performanc...
International audienceThis paper focuses on the resilient scheduling of moldable parallel jobs on hi...
International audienceApplications implemented on critical systems are subject to both safety critic...
International audienceThe optimization of parallel applications is difficult to achieve by classical...
(eng) Abstract Most list scheduling heuristics rely on a simple platform model where communication c...
International audienceScheduling in High-Performance Computing (HPC) has been traditionally centered...
Scheduling deteriorating jobs on parallel machines is an NP-hard problem, for which heuristics would...
In this paper, we study a scheduling problem with unreliable jobs. Each job is characterized by a su...
textabstractWhen jobs have to be processed on a set of identical parallel machines so as to minimize...
The application of computers in safety-critical systems is expanding rapidly. With reliability speci...
International audienceHeterogeneous distributed systems are widely deployed for executing computatio...