We present a peer logging system for reducing performance overhead in fault-tolerant distributed shared memory systems. Our system provides fault-tolerant shared memory using individual checkpointing and rollback. Peer logging logs DSM modification messages to remote nodes instead of to local disks. We present results for implementations of our fault-tolerant technique using simulations of both TreadMarks, a software-only DSM, and Cashmere, a DSM using memory mapped hardware. We compare simulations with no fault tolerance to simulations with local disk logging and peer logging. We present results showing that fault-tolerant Treadmarks can be achieved with an average of 17% overhead for peer logging. We also present results showing that whil...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
We present a new approach for building fault-tolerant distributed systems based on distributed trans...
Large-scale distributed systems are very attractive for the execution of parallel applications requi...
This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be eff...
This paper presents a fault tolerance algorithm for a home-based lazy release consistency distribute...
Rollback techniques that use message logging and deterministic replay can be used in parallel system...
Due to the character of the original source materials and the nature of batch digitization, quality ...
. The distributed shared memory(DSM) system transforms an existing network of workstations to a powe...
This thesis focuses on the issue of reliability and fault tolerance in Distributed Shared Memory Mul...
Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensur...
This paper investigates the problem of rollback recovery in distributed shared memory (DSM) systems....
Backward error recovery involving checkpointing and restart of tasks is an important component of an...
In order to deploy a tightly-coupled multiprocessor (TCMP) in the commercial world, the TCMP must be...
: Distributed Shared Memory (dsm) architectures are attractive to execute high performance parallel ...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
We present a new approach for building fault-tolerant distributed systems based on distributed trans...
Large-scale distributed systems are very attractive for the execution of parallel applications requi...
This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be eff...
This paper presents a fault tolerance algorithm for a home-based lazy release consistency distribute...
Rollback techniques that use message logging and deterministic replay can be used in parallel system...
Due to the character of the original source materials and the nature of batch digitization, quality ...
. The distributed shared memory(DSM) system transforms an existing network of workstations to a powe...
This thesis focuses on the issue of reliability and fault tolerance in Distributed Shared Memory Mul...
Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensur...
This paper investigates the problem of rollback recovery in distributed shared memory (DSM) systems....
Backward error recovery involving checkpointing and restart of tasks is an important component of an...
In order to deploy a tightly-coupled multiprocessor (TCMP) in the commercial world, the TCMP must be...
: Distributed Shared Memory (dsm) architectures are attractive to execute high performance parallel ...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
We present a new approach for building fault-tolerant distributed systems based on distributed trans...
Large-scale distributed systems are very attractive for the execution of parallel applications requi...