International audienceIn this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to...
International audienceThis paper provides a model and an analytical study of replication as a techni...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceThis paper investigates the optimal number of processors to execute a parallel...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceThis paper provides a model and an analytical study of replication as a techni...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceThis paper investigates the optimal number of processors to execute a parallel...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceThis paper provides a model and an analytical study of replication as a techni...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceThis paper investigates the optimal number of processors to execute a parallel...