Checkpointing strategies for parallel jobs

Bougeret, Marin
Casanova, Henri
Rabie, Mikael
Robert, Yves
Vivien, Frédéric

Publication date

April 2011

Publisher

HAL CCSD

Abstract

This work provides an analysis of checkpointing strategies for minimizing expected job execution times in an environment that is subject to processor failures. In the case of both sequential and parallel jobs, we give the optimal solution for exponentially distributed failure inter-arrival times, which, to the best of our knowledge, is the rst rigorous proof that periodic check- pointing is optimal. For non-exponentially distributed failures, we develop a dynamic programming algorithm to maximize the amount of work completed before the next failure, which provides a good heuristic for minimizing the ex- pected execution time. Our work considers various models of job parallelism and of parallel checkpointing overhead. We rst perform extensiv...