International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience techniques must accommodate both error sources. To cope with the double challenge, a two-level checkpointing and rollback recovery approach can be used, with additional verifications for silent error detection. A fail-stop error leads to the loss of the whole memory content, hence the obligation to checkpoint on a stable storage (e.g., an external disk). On the contrary, it is possible to use in-memory checkpoints for silent errors, which provide a much smaller checkpointing and recovery overhead. Furthermore, recent detectors offer partial verification mechanisms that are less costly than the guaranteed ones but do not detect al...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...