Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared memory (DSM). Although most recoverable DSM require specific hardware to store recovery data, our scheme uses standard memories to store both current and recovery data. Moreover, the management of recovery data is merged with the management of current data by extending the DSM's coherence protocol. This approach limits the hardware development and takes advantage of the data replicatio...
This paper investigates the problem of rollback recovery in distributed shared memory (DSM) systems....
This paper describes issues in the design and implementation of checkpointing and recovery modules f...
This paper describes issues in the design and implementation of checkpointing and recovery modules f...
Large-scale distributed systems are very attractive for the execution of parallel applications requi...
Large-scale distributed systems are very attractive for the execution of parallel applications requi...
this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable d...
Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require un...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensur...
. The distributed shared memory(DSM) system transforms an existing network of workstations to a powe...
Distributed Shared Memory (DSM) systems combine the ease of programming of Shared Memory Parallel Co...
Distributed Shared Memory (DSM) systems combine the ease of programming of Shared Memory Parallel Co...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
Distributed Shared Memory (DSM) systems combine the ease of programming of shared memory parallel co...
Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attracti...
This paper investigates the problem of rollback recovery in distributed shared memory (DSM) systems....
This paper describes issues in the design and implementation of checkpointing and recovery modules f...
This paper describes issues in the design and implementation of checkpointing and recovery modules f...
Large-scale distributed systems are very attractive for the execution of parallel applications requi...
Large-scale distributed systems are very attractive for the execution of parallel applications requi...
this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable d...
Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require un...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensur...
. The distributed shared memory(DSM) system transforms an existing network of workstations to a powe...
Distributed Shared Memory (DSM) systems combine the ease of programming of Shared Memory Parallel Co...
Distributed Shared Memory (DSM) systems combine the ease of programming of Shared Memory Parallel Co...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
Distributed Shared Memory (DSM) systems combine the ease of programming of shared memory parallel co...
Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attracti...
This paper investigates the problem of rollback recovery in distributed shared memory (DSM) systems....
This paper describes issues in the design and implementation of checkpointing and recovery modules f...
This paper describes issues in the design and implementation of checkpointing and recovery modules f...