International audienceThis paper provides a model and an analytical study of replication as a technique to detect and correct silent errors. Although other detection techniques exist for HPC applications, based on algorithms (ABFT), invariant preservation or data analytics, replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication needed to efficiently detect and correct silent errors. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of proce...
Various technological developments in the microprocessor world make modern computing systems more vu...
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent...
ABSTRACT As high-end computing machines continue to grow in size, issues such as fault tolerance and...
International audienceThis paper provides a model and an analytical study of replication as a techni...
International audienceThis paper provides a model and an analytical study of replication as a techni...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Perfo...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceLarge-scale platforms currently experience errors from two different sources, ...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and sile...
In this work we propose partial task replication and check-pointing for task-parallel HPC applicatio...
Resilience has become a critical problem for high performance computing. Checkpointing proto-cols ar...
Various technological developments in the microprocessor world make modern computing systems more vu...
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent...
ABSTRACT As high-end computing machines continue to grow in size, issues such as fault tolerance and...
International audienceThis paper provides a model and an analytical study of replication as a techni...
International audienceThis paper provides a model and an analytical study of replication as a techni...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Perfo...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceLarge-scale platforms currently experience errors from two different sources, ...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and sile...
In this work we propose partial task replication and check-pointing for task-parallel HPC applicatio...
Resilience has become a critical problem for high performance computing. Checkpointing proto-cols ar...
Various technological developments in the microprocessor world make modern computing systems more vu...
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent...
ABSTRACT As high-end computing machines continue to grow in size, issues such as fault tolerance and...