This work provides an analysis of checkpointing strategies for minimizing expected job execution times in an environment that is subject to processor failures. In the case of both sequential and parallel jobs, we give the optimal solution for exponentially distributed failure inter-arrival times, which, to the best of our knowledge, is the rst rigorous proof that periodic check- pointing is optimal. For non-exponentially distributed failures, we develop a dynamic programming algorithm to maximize the amount of work completed before the next failure, which provides a good heuristic for minimizing the ex- pected execution time. Our work considers various models of job parallelism and of parallel checkpointing overhead. We rst perform extensiv...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
Le passage de l'échelle des nouvelles plates-formes de calcul parallèle et distribué soulève de nomb...
One has a large workload that is ``divisible''---its constituent work's granularity can be adjusted ...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
This work deals with scheduling and checkpointing strategies to execute scientific workflows on fail...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
This paper studies checkpointing strategies for parallel jobs subject to fail-stop errors. The optim...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...
This report combines checkpointing and replication for the reliable executionof linear workows. Whil...
In high-performance computing environments, input/output (I/O) from varioussources often contend for...
In high-performance computing environments, input/output (I/O) from varioussources often contend for...
High performance computing applications must be resilient to faults, which are common occurrences es...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
This paper focuses on the resilient scheduling of parallel jobs on highperformance computing (HPC) p...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
Le passage de l'échelle des nouvelles plates-formes de calcul parallèle et distribué soulève de nomb...
One has a large workload that is ``divisible''---its constituent work's granularity can be adjusted ...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
This work deals with scheduling and checkpointing strategies to execute scientific workflows on fail...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
This paper studies checkpointing strategies for parallel jobs subject to fail-stop errors. The optim...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...
This report combines checkpointing and replication for the reliable executionof linear workows. Whil...
In high-performance computing environments, input/output (I/O) from varioussources often contend for...
In high-performance computing environments, input/output (I/O) from varioussources often contend for...
High performance computing applications must be resilient to faults, which are common occurrences es...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
This paper focuses on the resilient scheduling of parallel jobs on highperformance computing (HPC) p...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
Le passage de l'échelle des nouvelles plates-formes de calcul parallèle et distribué soulève de nomb...
One has a large workload that is ``divisible''---its constituent work's granularity can be adjusted ...