International audienceIn this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to...
Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. ...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. ...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. ...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...