Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. When a silent error strikes, it is not detected immediately but only after some delay, which prevents the use of pure periodic checkpointing approaches devised for fail-stop errors. Instead, checkpointing must be coupled with some verification mechanism to guarantee that corrupted data will never be written into the checkpoint file. Such a guaranteed verification mechanism typically incurs a high cost. In this paper, we investigate the use of partial verification mechanisms in addition to a guaranteed verification. The main objective is to investigate to which extent it is worthwhile to use some light-cost but less precise verification type i...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceMany methods are available to detect silent errors in high-performance computi...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
International audienceSilent errors, or silent data corruptions, constitute a major threat on very l...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
Several recent papers have introduced a periodic verification mechanism to detect silent errorsin it...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceMany methods are available to detect silent errors in high-performance computi...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
International audienceSilent errors, or silent data corruptions, constitute a major threat on very l...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
Several recent papers have introduced a periodic verification mechanism to detect silent errorsin it...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceMany methods are available to detect silent errors in high-performance computi...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...