This paper provides a model and an analytical study of replication as a techniqueto detect and correct silent errors, as well as to cope with both silent and fail-stop errors onlarge-scale platforms. Fail-stop errors are immediately detected, unlike silent errors for whicha detection mechanism is required. To detect silent errors, many application-specific techniquesare available, either based on algorithms (ABFT), invariant preservation or data analytics, butreplication remains the most transparent and least intrusive technique. We explore the right level(duplication, triplication or more) of replication for two frameworks: (i) when the platform issubject only to silent errors, and (ii) when the platform is subject to both silent and fail...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
International audienceThis paper provides a model and an analytical study of replication as a techni...
International audienceThis paper provides a model and an analytical study of replication as a techni...
Large-scale platforms currently experience errors from two different sources,namely fail-stop errors...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
International audienceLarge-scale platforms currently experience errors from two different sources, ...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. ...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
International audienceThis paper provides a model and an analytical study of replication as a techni...
International audienceThis paper provides a model and an analytical study of replication as a techni...
Large-scale platforms currently experience errors from two different sources,namely fail-stop errors...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
International audienceLarge-scale platforms currently experience errors from two different sources, ...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. ...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...