International audienceThis work provides an analysis of checkpointing strategies for minimizing expected job execution times in an environ- ment that is subject to processor failures. In the case of both sequential and parallel jobs, we give the optimal solu- tion for exponentially distributed failure inter-arrival times, which, to the best of our knowledge, is the first rigorous proof that periodic checkpointing is optimal. For non-ex- ponentially distributed failures, we develop a dynamic pro- gramming algorithm to maximize the amount of work com- pleted before the next failure, which provides a good heuris- tic for minimizing the expected execution time. Our work considers various models of job parallelism and of parallel checkpointing o...
International audienceThe parallel computing platforms available today are increasingly larger and t...
This report provides an introduction to the design of scheduling algorithms to cope with faults on l...
This report provides an introduction to the design of scheduling algorithms to cope with faults on l...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
This paper studies checkpointing strategies for parallel jobs subject to fail-stop errors. The optim...
This paper studies checkpointing strategies for parallel jobs subject to fail-stop errors. The optim...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
International audienceThis paper deals with the complexity of scheduling computational workflows in ...
Abstract—This paper deals with the complexity of scheduling computational workflows in the presence ...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
International audienceThe parallel computing platforms available today are increasingly larger and t...
International audienceThe parallel computing platforms available today are increasingly larger and t...
This report provides an introduction to the design of scheduling algorithms to cope with faults on l...
This report provides an introduction to the design of scheduling algorithms to cope with faults on l...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
This paper studies checkpointing strategies for parallel jobs subject to fail-stop errors. The optim...
This paper studies checkpointing strategies for parallel jobs subject to fail-stop errors. The optim...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
International audienceThis paper deals with the complexity of scheduling computational workflows in ...
Abstract—This paper deals with the complexity of scheduling computational workflows in the presence ...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
International audienceThe parallel computing platforms available today are increasingly larger and t...
International audienceThe parallel computing platforms available today are increasingly larger and t...
This report provides an introduction to the design of scheduling algorithms to cope with faults on l...
This report provides an introduction to the design of scheduling algorithms to cope with faults on l...