This paper describes a checkpoint comparison and optimistic execution technique for error detection and recovery in distributed and parallel systems. The approach is based on lookahead execution and rollback validation. It uses replicated tasks executing on different processors for forward recovery and checkpoint comparison for error detection. Two schemes derived from this strategy are analyzed and compared with triplication and voting, and with two common backward recovery methods. The impact of checkpoint time, checkpoint validation time. and process restart time is also examined. An implementation on a Sun NFS network with six benchmark programs is presented. Compared with classic checkpointing and rollback techniques, our strategy prov...
In this paper, we have addressed the complex problem of recovery for concurrent failures in distribu...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
This thesis studies a forward recovery strategy using checkpointing and optimistic execution in para...
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpo...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Abstract. We propose a generalized forward recovery checkpointing scheme, with lookahead execution a...
This thesis studies a forward recovery strategy using checkpointing and optimistic execution in para...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
We consider the problem of bringing a distributed system to a consistent state after transient fail...
Checkpoint-based rollback recovery is a very popular category of fault toler-ance techniques, which ...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
In this paper, we have addressed the complex problem of recovery for concurrent failures in distribu...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
This thesis studies a forward recovery strategy using checkpointing and optimistic execution in para...
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpo...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Abstract. We propose a generalized forward recovery checkpointing scheme, with lookahead execution a...
This thesis studies a forward recovery strategy using checkpointing and optimistic execution in para...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
We consider the problem of bringing a distributed system to a consistent state after transient fail...
Checkpoint-based rollback recovery is a very popular category of fault toler-ance techniques, which ...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
In this paper, we have addressed the complex problem of recovery for concurrent failures in distribu...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...