This paper focuses on the resilient scheduling of parallel jobs on highperformance computing (HPC) platforms to minimize the overall completion time, or makespan. We revisit the problem by assuming that jobs are subject to transient or silent errors, and hence may need to be re-executed each time they fail to complete successfully. This work generalizes the classical framework where jobs are known offline and do not fail: in this classical framework, list scheduling that gives priority to longest jobs is known to be a 3-approximation when imposing to use shelves, and a 2-approximation without this restriction. We show that when jobs can fail, using shelves can be arbitrarily bad, butunrestricted list scheduling remains a 2-approximation. Th...
textabstractWhen jobs have to be processed on a set of identical parallel machines so as to minimize...
The application of computers in safety-critical systems is expanding rapidly. With reliability speci...
In this paper, we study a scheduling problem with unreliable jobs. Each job is characterized by a su...
International audienceThis paper focuses on the resilient scheduling of parallel jobs on high-perfor...
International audienceThis paper focuses on the resilient scheduling of parallel jobs on high-perfor...
International audienceWe study the resilient scheduling of moldable parallel jobs on high-performanc...
International audienceThis paper focuses on the resilient scheduling of moldable parallel jobs on hi...
International audienceApplications implemented on critical systems are subject to both safety critic...
International audienceScheduling in High-Performance Computing (HPC) has been traditionally centered...
(eng) Abstract Most list scheduling heuristics rely on a simple platform model where communication c...
International audienceThe optimization of parallel applications is difficult to achieve by classical...
Scheduling deteriorating jobs on parallel machines is an NP-hard problem, for which heuristics would...
Imprecise computation and parallel processing are two techniques for avoiding timing faults and tole...
International audienceHeterogeneous distributed systems are widely deployed for executing computatio...
textabstractWhen jobs have to be processed on a set of identical parallel machines so as to minimize...
The application of computers in safety-critical systems is expanding rapidly. With reliability speci...
In this paper, we study a scheduling problem with unreliable jobs. Each job is characterized by a su...
International audienceThis paper focuses on the resilient scheduling of parallel jobs on high-perfor...
International audienceThis paper focuses on the resilient scheduling of parallel jobs on high-perfor...
International audienceWe study the resilient scheduling of moldable parallel jobs on high-performanc...
International audienceThis paper focuses on the resilient scheduling of moldable parallel jobs on hi...
International audienceApplications implemented on critical systems are subject to both safety critic...
International audienceScheduling in High-Performance Computing (HPC) has been traditionally centered...
(eng) Abstract Most list scheduling heuristics rely on a simple platform model where communication c...
International audienceThe optimization of parallel applications is difficult to achieve by classical...
Scheduling deteriorating jobs on parallel machines is an NP-hard problem, for which heuristics would...
Imprecise computation and parallel processing are two techniques for avoiding timing faults and tole...
International audienceHeterogeneous distributed systems are widely deployed for executing computatio...
textabstractWhen jobs have to be processed on a set of identical parallel machines so as to minimize...
The application of computers in safety-critical systems is expanding rapidly. With reliability speci...
In this paper, we study a scheduling problem with unreliable jobs. Each job is characterized by a su...