The reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. One such technique is based on checkpointing and rollback recovery. Checkpointing involves processes to take snapshots of their current states regularly, so that a rollback recovery strategy is able to bring the system back to a previous consistent state whenever a failure occurs. In this paper, we consider a message-passing concurrent programming language and propose a novel rollback recovery strategy that is based on some explicit checkpointing operators and the use of a (partially) reversible semantics for rolling back the system.Comment: To appear in the Proceedings of the 19th International Conference on Formal Aspec...
[EN] Causal-consistent reversible debugging is an innovative technique for debugging concurrent syst...
This survey covers rollback-recovery techniques that do not require special language constructs. In ...
We have addressed the complex problem of recovery for concurrent failures in distributed computing e...
International audienceThis paper analyzes the relationship between a distributed checkpoint/rollback...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
rollback, recovery The problem of rollback-recovery in message-passing systems has undergone extensi...
International audienceRollback is a fundamental technique for ensuring reliability of systems, allow...
International audienceConcurrent reversibility has been studied in different ar- eas, such as biolog...
This survey covers rollback-recovery techniques that do not require special language constructs. In ...
Checkpointing protocols usually rely on the constitution of consistent global states, from which the...
This thesis studies a forward recovery strategy using checkpointing and optimistic execution in para...
We consider the problem of bringing a distributed system to a consistent state after transient fail...
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpo...
Checkpoint-based rollback recovery is a very popular category of fault toler-ance techniques, which ...
International audienceReversible computing allows one to run programs not only in the usual forward ...
[EN] Causal-consistent reversible debugging is an innovative technique for debugging concurrent syst...
This survey covers rollback-recovery techniques that do not require special language constructs. In ...
We have addressed the complex problem of recovery for concurrent failures in distributed computing e...
International audienceThis paper analyzes the relationship between a distributed checkpoint/rollback...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
rollback, recovery The problem of rollback-recovery in message-passing systems has undergone extensi...
International audienceRollback is a fundamental technique for ensuring reliability of systems, allow...
International audienceConcurrent reversibility has been studied in different ar- eas, such as biolog...
This survey covers rollback-recovery techniques that do not require special language constructs. In ...
Checkpointing protocols usually rely on the constitution of consistent global states, from which the...
This thesis studies a forward recovery strategy using checkpointing and optimistic execution in para...
We consider the problem of bringing a distributed system to a consistent state after transient fail...
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpo...
Checkpoint-based rollback recovery is a very popular category of fault toler-ance techniques, which ...
International audienceReversible computing allows one to run programs not only in the usual forward ...
[EN] Causal-consistent reversible debugging is an innovative technique for debugging concurrent syst...
This survey covers rollback-recovery techniques that do not require special language constructs. In ...
We have addressed the complex problem of recovery for concurrent failures in distributed computing e...