This paper studies checkpointing strategies for parallel jobs subject to fail-stop errors. The optimal strategy is well known when failure inter-arrival times obey an Exponential law, but it is unknown for non-memoryless failure distributions. We explain why the latter fact is misunderstood in recent literature. We propose a general strategy that maximizes the expected efficiency until the next failure, and we show that this strategy is asymptotically optimal for very long jobs. Through extensive simulations, we show that the new strategy is always at least as good as the Young/Daly strategy for various failure distributions. For distributions with a high infant mortality (such as LogNormal 2.51 or Weibull 0.5), the execution time is divide...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...
This paper investigates the optimal number of processors to execute a parallel job, whose speedup pr...
This paper studies checkpointing strategies for parallel jobs subject to fail-stop errors. The optim...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
This paper revisits checkpointing strategies when workflows composed of multiple tasks execute on a ...
This paper revisits checkpointing strategies when workflows composed of multiple tasks execute on a ...
This paper revisits checkpointing strategies when workflows composed of multiple tasks execute on a ...
This paper revisits checkpointing strategies when workflows composed of multiple tasks execute on a ...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
Le passage de l'échelle des nouvelles plates-formes de calcul parallèle et distribué soulève de nomb...
This work provides an optimal checkpointing strategy to protect iterative applications from fail-sto...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...
This paper investigates the optimal number of processors to execute a parallel job, whose speedup pr...
This paper studies checkpointing strategies for parallel jobs subject to fail-stop errors. The optim...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
This paper revisits checkpointing strategies when workflows composed of multiple tasks execute on a ...
This paper revisits checkpointing strategies when workflows composed of multiple tasks execute on a ...
This paper revisits checkpointing strategies when workflows composed of multiple tasks execute on a ...
This paper revisits checkpointing strategies when workflows composed of multiple tasks execute on a ...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
Le passage de l'échelle des nouvelles plates-formes de calcul parallèle et distribué soulève de nomb...
This work provides an optimal checkpointing strategy to protect iterative applications from fail-sto...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...
This paper investigates the optimal number of processors to execute a parallel job, whose speedup pr...