We propose a new algorithm for recovering asynchronously from failures in a distributed computation. Our algorithm is based on two novel concepts- a fault-tolerant vector clock to maintain causality information in spite of failures, and a history mechanism to detect orphan states and obsolete messages. These two mechanisms together with checkpointing and message-logging are used to restore the system to a consistent state after a failure of one or more processes. Our algorithm is completely asynchronous. It handles multiple failures and network partitioning, does not assume any message ordering, causes the minimum amount of rollback and restores the maximum recoverable state with low overhead. Earlier optimistic protocols lack one or more o...
Basing rollback recovery on optimistic message logging and replay avoids the need for synchronizatio...
In this work, we have addressed the complex problem of recovery for concurrent failures in distribut...
In this paper, we have addressed the complex problem of recovery for concurrent failures in distribu...
In this paper, we present a new protocol for optimistic rollback recovery in distributed systems. Th...
In this work, we present a high performance recovery algorithm for distributed systems in which chec...
.... Abstract a process is logged on stable storage [5], and each process is occasionally checkpoint...
In this work we have addressed the complex problem of recovery for concurrent failures in a distribu...
This paper presents a deterministic algorithm that solves consensus in asynchronous distributed syst...
This paper presents a deterministic algorithm that solves consensus in asynchronous distributed syst...
We introduce a new algorithm for consistent failure detection in asynchronous systems. Informally, c...
Message logging and checkpointing can provide fault tolerance in distributed systems in which all pr...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
We study the problem ofachieving reliable communication with quiescent algorithms (i.e., algorithms ...
dbj ©rice.edu In a distributed system using rollback recovery, information saved on stable storage d...
Basing rollback recovery on optimistic message logging and replay avoids the need for synchronizatio...
In this work, we have addressed the complex problem of recovery for concurrent failures in distribut...
In this paper, we have addressed the complex problem of recovery for concurrent failures in distribu...
In this paper, we present a new protocol for optimistic rollback recovery in distributed systems. Th...
In this work, we present a high performance recovery algorithm for distributed systems in which chec...
.... Abstract a process is logged on stable storage [5], and each process is occasionally checkpoint...
In this work we have addressed the complex problem of recovery for concurrent failures in a distribu...
This paper presents a deterministic algorithm that solves consensus in asynchronous distributed syst...
This paper presents a deterministic algorithm that solves consensus in asynchronous distributed syst...
We introduce a new algorithm for consistent failure detection in asynchronous systems. Informally, c...
Message logging and checkpointing can provide fault tolerance in distributed systems in which all pr...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
We study the problem ofachieving reliable communication with quiescent algorithms (i.e., algorithms ...
dbj ©rice.edu In a distributed system using rollback recovery, information saved on stable storage d...
Basing rollback recovery on optimistic message logging and replay avoids the need for synchronizatio...
In this work, we have addressed the complex problem of recovery for concurrent failures in distribut...
In this paper, we have addressed the complex problem of recovery for concurrent failures in distribu...