This paper presents a fault tolerance algorithm for a home-based lazy release consistency distributed shared memory (DSM) system based on volatile logging and independent checkpointing. The proposed approach targets large-scale distributed shared-memory computing on local-area clusters of computers as well as collaborative shared-memory applications on wide-area meta-clusters over the Internet. The challenge in building such systems lies in controlling the size of the logs and to garbage collect the unnecessary checkpoints in the absence of global coordination. In this paper we define a set of rules for lazy log trimming (LLT) and checkpoint garbage collection (CGC) and prove that they do not affect the recoverability of the system. We have...
Relaxed memory consistency models tolerate increased memory access latency in both hardware and soft...
In order to deploy a tightly-coupled multiprocessor (TCMP) in the commercial world, the TCMP must be...
Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensur...
This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be eff...
We present a peer logging system for reducing performance overhead in fault-tolerant distributed sha...
Rollback techniques that use message logging and deterministic replay can be used in parallel system...
This paper presents an efficient, writer-based logging scheme for recoverable distributed shared mem...
This paper investigates the problem of rollback recovery in distributed shared memory (DSM) systems....
. The distributed shared memory(DSM) system transforms an existing network of workstations to a powe...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
Abstract. A common approach to fault-tolerant software DSM is to take checkpoints with message loggi...
Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attracti...
Large-scale distributed systems are very attractive for the execution of parallel applications requi...
Relaxed memory consistency models tolerate increased memory access latency in both hardware and soft...
In order to deploy a tightly-coupled multiprocessor (TCMP) in the commercial world, the TCMP must be...
Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensur...
This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be eff...
We present a peer logging system for reducing performance overhead in fault-tolerant distributed sha...
Rollback techniques that use message logging and deterministic replay can be used in parallel system...
This paper presents an efficient, writer-based logging scheme for recoverable distributed shared mem...
This paper investigates the problem of rollback recovery in distributed shared memory (DSM) systems....
. The distributed shared memory(DSM) system transforms an existing network of workstations to a powe...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
Abstract. A common approach to fault-tolerant software DSM is to take checkpoints with message loggi...
Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attracti...
Large-scale distributed systems are very attractive for the execution of parallel applications requi...
Relaxed memory consistency models tolerate increased memory access latency in both hardware and soft...
In order to deploy a tightly-coupled multiprocessor (TCMP) in the commercial world, the TCMP must be...
Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensur...