This work provides an optimal checkpointing strategy to protect iterative applications from fail-stop errors. We consider a very general framework, where the application repeats the same execution pattern by executing consecutive iterations, and where each iteration is composed of several tasks. These tasks have different execution lengths and different checkpoint costs. Assume that there are $n$ tasks and that task $a_i$, where $0 ≤ i < n$, has execution time $t_i$ and checkpoint cost $C_i$. A naive strategy would checkpoint after each task. A strategy inspired by the Young/Daly formula would select the task $a_{min}$ with smallest checkpoint cost $C_{min}$ and would checkpoint after every $p^{th}$ instance of that task, leading to a check...
Due to the increasing number of nodes in supercomputers, scientific applications are frequently inte...
Due to the increasing number of nodes in supercomputers, scientific applications are frequently inte...
The Young/Daly formula for periodic checkpointing is known to hold fora divisible load application w...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
This paper revisits checkpointing strategies when workflows composed of multiple tasks execute on a ...
This paper revisits checkpointing strategies when workflows composed of multiple tasks execute on a ...
This paper revisits checkpointing strategies when workflows composed of multiple tasks execute on a ...
This paper revisits checkpointing strategies when workflows composed of multiple tasks execute on a ...
Due to the increasing number of nodes in supercomputers, scientific applications are frequently inte...
Due to the increasing number of nodes in supercomputers, scientific applications are frequently inte...
Due to the increasing number of nodes in supercomputers, scientific applications are frequently inte...
Due to the increasing number of nodes in supercomputers, scientific applications are frequently inte...
The Young/Daly formula for periodic checkpointing is known to hold fora divisible load application w...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
This paper revisits checkpointing strategies when workflows composed of multiple tasks execute on a ...
This paper revisits checkpointing strategies when workflows composed of multiple tasks execute on a ...
This paper revisits checkpointing strategies when workflows composed of multiple tasks execute on a ...
This paper revisits checkpointing strategies when workflows composed of multiple tasks execute on a ...
Due to the increasing number of nodes in supercomputers, scientific applications are frequently inte...
Due to the increasing number of nodes in supercomputers, scientific applications are frequently inte...
Due to the increasing number of nodes in supercomputers, scientific applications are frequently inte...
Due to the increasing number of nodes in supercomputers, scientific applications are frequently inte...
The Young/Daly formula for periodic checkpointing is known to hold fora divisible load application w...