Rollback techniques that use message logging and deterministic replay can be used in parallel systems to recover a failed node without involving other nodes. Distributed shared memory (DSM) systems cannot directly apply message-passing logging techniques because they use inherently nondeterministic asynchronous communication. This paper presents new logging schemes that reduce the typically high overhead for logging in DSM. Our algorithm for sequentially consistent systems tracks rather than logs accesses to shared memory. In an extension of this method to lazy release consistency, the per-access overhead of tracking has been completely eliminated. Measurements with parallel applications show a significant reduction in failure-free overhead...
Distributed Shared Memory (DSM) systems combine the ease of programming of Shared Memory Parallel Co...
This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be eff...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
This paper investigates the problem of rollback recovery in distributed shared memory (DSM) systems....
. The distributed shared memory(DSM) system transforms an existing network of workstations to a powe...
Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attracti...
Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensur...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require un...
This paper presents an efficient, writer-based logging scheme for recoverable distributed shared mem...
Abstract. A common approach to fault-tolerant software DSM is to take checkpoints with message loggi...
This paper presents a fault tolerance algorithm for a home-based lazy release consistency distribute...
Relaxed memory consistency models tolerate increased memory access latency in both hardware and soft...
We present a peer logging system for reducing performance overhead in fault-tolerant distributed sha...
Large-scale distributed systems are very attractive for the execution of parallel applications requi...
Distributed Shared Memory (DSM) systems combine the ease of programming of Shared Memory Parallel Co...
This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be eff...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
This paper investigates the problem of rollback recovery in distributed shared memory (DSM) systems....
. The distributed shared memory(DSM) system transforms an existing network of workstations to a powe...
Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attracti...
Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensur...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require un...
This paper presents an efficient, writer-based logging scheme for recoverable distributed shared mem...
Abstract. A common approach to fault-tolerant software DSM is to take checkpoints with message loggi...
This paper presents a fault tolerance algorithm for a home-based lazy release consistency distribute...
Relaxed memory consistency models tolerate increased memory access latency in both hardware and soft...
We present a peer logging system for reducing performance overhead in fault-tolerant distributed sha...
Large-scale distributed systems are very attractive for the execution of parallel applications requi...
Distributed Shared Memory (DSM) systems combine the ease of programming of Shared Memory Parallel Co...
This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be eff...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...