Abstract. A common approach to fault-tolerant software DSM is to take checkpoints with message logging. Our remote logging has low overhead because each node saves the coherence-related data into the memory of a remote node through a high-speed system area network. For more lightweight fault-tolerant DSM, in this paper, we mainly focused on eliminating shared memory checkpointing during failure-free execution. Each node independently takes the checkpoints of execution states and non-shared data only. When a node fails, it regenerates its pages from the remote copies in live nodes. In order to efficiently reconstruct pages, we also introduced a XOR-diffing technique. The diff logs, which have been created by XOR operations during failure-fre...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
In order to deploy a tightly-coupled multiprocessor (TCMP) in the commercial world, the TCMP must be...
Backward error recovery involving checkpointing and restart of tasks is an important component of an...
This paper investigates the problem of rollback recovery in distributed shared memory (DSM) systems....
Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attracti...
Rollback techniques that use message logging and deterministic replay can be used in parallel system...
. The distributed shared memory(DSM) system transforms an existing network of workstations to a powe...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Large-scale distributed systems are very attractive for the execution of parallel applications requi...
Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensur...
Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require un...
This paper presents a fault tolerance algorithm for a home-based lazy release consistency distribute...
This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be eff...
Distributed Shared Memory (DSM) systems combine the ease of programming of Shared Memory Parallel Co...
Distributed Shared Memory (DSM) systems combine the ease of programming of shared memory parallel co...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
In order to deploy a tightly-coupled multiprocessor (TCMP) in the commercial world, the TCMP must be...
Backward error recovery involving checkpointing and restart of tasks is an important component of an...
This paper investigates the problem of rollback recovery in distributed shared memory (DSM) systems....
Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attracti...
Rollback techniques that use message logging and deterministic replay can be used in parallel system...
. The distributed shared memory(DSM) system transforms an existing network of workstations to a powe...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Large-scale distributed systems are very attractive for the execution of parallel applications requi...
Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensur...
Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require un...
This paper presents a fault tolerance algorithm for a home-based lazy release consistency distribute...
This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be eff...
Distributed Shared Memory (DSM) systems combine the ease of programming of Shared Memory Parallel Co...
Distributed Shared Memory (DSM) systems combine the ease of programming of shared memory parallel co...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
In order to deploy a tightly-coupled multiprocessor (TCMP) in the commercial world, the TCMP must be...
Backward error recovery involving checkpointing and restart of tasks is an important component of an...