This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be efficiently extended to tolerate single-node failures. In particular,we extend a home-based lazy release consistency (HLRC) DSM system with independent checkpointing and logging to volatile memory,targeting shared-memory computing on very large LAN-based clusters. In these environments,where global coordination may be expensive,independent checkpointing becomes critical to scalability. However,independent checkpointing is only practical if we can control the size of the log and checkpoints in the absence of global coordination. In this paper we describe the design of our fault-tolerant DSM system and present our solutions to the problems of chec...
desirable features: A process can independently initiate consistent global checkpointing by saving i...
This thesis focuses on the issue of reliability and fault tolerance in Distributed Shared Memory Mul...
In order to deploy a tightly-coupled multiprocessor (TCMP) in the commercial world, the TCMP must be...
This paper presents a fault tolerance algorithm for a home-based lazy release consistency distribute...
This research proposes an algorithm for fault-tolerance in a home-based lazy release consistent dist...
We present a peer logging system for reducing performance overhead in fault-tolerant distributed sha...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
: Distributed Shared Memory (dsm) architectures are attractive to execute high performance parallel ...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Large-scale distributed systems are very attractive for the execution of parallel applications requi...
Rollback techniques that use message logging and deterministic replay can be used in parallel system...
Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attracti...
This paper investigates the problem of rollback recovery in distributed shared memory (DSM) systems....
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
Backward error recovery involving checkpointing and restart of tasks is an important component of an...
desirable features: A process can independently initiate consistent global checkpointing by saving i...
This thesis focuses on the issue of reliability and fault tolerance in Distributed Shared Memory Mul...
In order to deploy a tightly-coupled multiprocessor (TCMP) in the commercial world, the TCMP must be...
This paper presents a fault tolerance algorithm for a home-based lazy release consistency distribute...
This research proposes an algorithm for fault-tolerance in a home-based lazy release consistent dist...
We present a peer logging system for reducing performance overhead in fault-tolerant distributed sha...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
: Distributed Shared Memory (dsm) architectures are attractive to execute high performance parallel ...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Large-scale distributed systems are very attractive for the execution of parallel applications requi...
Rollback techniques that use message logging and deterministic replay can be used in parallel system...
Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attracti...
This paper investigates the problem of rollback recovery in distributed shared memory (DSM) systems....
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
Backward error recovery involving checkpointing and restart of tasks is an important component of an...
desirable features: A process can independently initiate consistent global checkpointing by saving i...
This thesis focuses on the issue of reliability and fault tolerance in Distributed Shared Memory Mul...
In order to deploy a tightly-coupled multiprocessor (TCMP) in the commercial world, the TCMP must be...