International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph forms a linear chain, and we extend single-level checkpointing in two important directions. Our first contribution targets silent errors, and combines in-memory checkpoints with both partial and guaranteed verifications. Our second contribution deals with multi-level checkpointing for fail-stop errors. We present sophisticated dynamic programming algorithms that return the optimal solution for each problem in polynomial time. We also show how to combine all these techniques and solve the problem with both fail-stop and silent errors. Simulation results demonstrate that these extensions lead to significantly improved performance compared to th...
Large-scale platforms currently experience errors from two different sources,namely fail-stop errors...
International audienceThis paper provides a model and an analytical study of replication as a techni...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
International audienceThis chapter describes a unified framework for the detection and correction of...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
Large-scale platforms currently experience errors from two different sources,namely fail-stop errors...
International audienceThis paper provides a model and an analytical study of replication as a techni...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
International audienceThis chapter describes a unified framework for the detection and correction of...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
Large-scale platforms currently experience errors from two different sources,namely fail-stop errors...
International audienceThis paper provides a model and an analytical study of replication as a techni...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...