International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal with fail-stop errors. Many others deal with silent errors (or silent data corruptions). But very few papers deal with fail-stop and silent errors simultaneously. However, HPC applications will obviously have to cope with both error sources. This paper presents a unified framework and optimal algorithmic solutions to this double challenge. Silent errors are handled via verification mechanisms (either partially or fully accurate) and in-memory checkpoints. Fail-stop errors are processed via disk checkpoints. All verification and checkpoint types are combined into computational patterns. We provide a unified model, and a full characterization o...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceMany methods are available to detect silent errors in high-performance computi...
Cette thèse s'intéresse à la résilience pour les applications haute performance à très grande échell...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
International audienceThis chapter describes a unified framework for the detection and correction of...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
This thesis focuses on resilience for high performance applications that execute on large scale plat...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceMany methods are available to detect silent errors in high-performance computi...
Cette thèse s'intéresse à la résilience pour les applications haute performance à très grande échell...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
International audienceThis chapter describes a unified framework for the detection and correction of...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
This thesis focuses on resilience for high performance applications that execute on large scale plat...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceMany methods are available to detect silent errors in high-performance computi...
Cette thèse s'intéresse à la résilience pour les applications haute performance à très grande échell...