This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/19117This dissertation presents a new protocol that allows rollback-recovery and process replication to co-exist in a distributed system. The protocol relies on a novel data structure called the antecedence graph, which tracks the nondeterministic events during failure-free operation and provides information for recreating them if a failure occurs. The rollback-recovery part of the protocol combines the low failure-free overhead of optimistic rollback-recovery with the advantages of pessimistic rollback-recovery, namely fast output commit, limited rollback, and failure-containment. The process replication part of the protocol features anew mult...
The provision of fault tolerance is an important aspect to the success of distributed and cluster co...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
this paper, we concentrate on techniques for tolerating failures in these environments. In this cont...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
Manetho is a new transparent rollback_recovery protocol for long-running distributed computations. I...
In this paper, we present a new protocol for optimistic rollback recovery in distributed systems. Th...
As human dependence on computing technology increases, so does the need for computer system dependab...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
The provision of fault tolerance is an important aspect to the success of distributed and cluster co...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
this paper, we concentrate on techniques for tolerating failures in these environments. In this cont...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
Manetho is a new transparent rollback_recovery protocol for long-running distributed computations. I...
In this paper, we present a new protocol for optimistic rollback recovery in distributed systems. Th...
As human dependence on computing technology increases, so does the need for computer system dependab...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
The provision of fault tolerance is an important aspect to the success of distributed and cluster co...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...