This paper provides a model and an analytical study of replication as a techniqueto detect and correct silent errors, as well as to cope with both silent and fail-stop errors onlarge-scale platforms. Fail-stop errors are immediately detected, unlike silent errors for whicha detection mechanism is required. To detect silent errors, many application-specific techniquesare available, either based on algorithms (ABFT), invariant preservation or data analytics, butreplication remains the most transparent and least intrusive technique. We explore the right level(duplication, triplication or more) of replication for two frameworks: (i) when the platform issubject only to silent errors, and (ii) when the platform is subject to both silent and fail...
International audienceThe move towards exascale super-computers requires new fault tolerance solutio...
In a database cluster, preventive replication can provide strong consistency without the limitations...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
High performance computing applications must be resilient to faults, which are common occurrences es...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. ...
A model checker can produce a trace of counter-example for erroneous program, which is often difficu...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
Shared-memory concurrency is a classic concurrency model which, among other things, makes it possibl...
Checkpointing is a classical technique to mitigate the overhead of adjoint Algorithmic Differentiati...
This document from the fmr group introduces four types of methods for simplifying and/or partitionin...
International audienceThe move towards exascale super-computers requires new fault tolerance solutio...
In a database cluster, preventive replication can provide strong consistency without the limitations...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
High performance computing applications must be resilient to faults, which are common occurrences es...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. ...
A model checker can produce a trace of counter-example for erroneous program, which is often difficu...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
Shared-memory concurrency is a classic concurrency model which, among other things, makes it possibl...
Checkpointing is a classical technique to mitigate the overhead of adjoint Algorithmic Differentiati...
This document from the fmr group introduces four types of methods for simplifying and/or partitionin...
International audienceThe move towards exascale super-computers requires new fault tolerance solutio...
In a database cluster, preventive replication can provide strong consistency without the limitations...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...