Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience techniques must accommodate both error sources. To cope with the double challenge, a two-level checkpointing and rollback recovery approach can be used, with additional verifications for silent error detection. A fail-stop error leads to the loss of the whole memory content, hence the obligation to checkpoint on a stable storage (e.g., an external disk). On the contrary, it is possible to use in-memory checkpoints for silent errors, which provide a much smaller checkpointing and recovery overhead. Furthermore, recent detectors offer partial verification mechanisms that are less costly than the guaranteed ones but do not detect all silent errors. In th...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
Large-scale platforms currently experience errors from two different sources,namely fail-stop errors...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
This report describes a unified framework for the detection and correction of silent errors,which co...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
Several recent papers have introduced a periodic verification mechanism to detect silent errorsin it...
Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. ...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
This report combines checkpointing and replication for the reliable executionof linear workows. Whil...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
Large-scale platforms currently experience errors from two different sources,namely fail-stop errors...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
This report describes a unified framework for the detection and correction of silent errors,which co...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
Several recent papers have introduced a periodic verification mechanism to detect silent errorsin it...
Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. ...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
This report combines checkpointing and replication for the reliable executionof linear workows. Whil...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
Large-scale platforms currently experience errors from two different sources,namely fail-stop errors...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...