Backward error recovery involving checkpointing and restart of tasks is an important component of any system providing fault tolerance to applicati- ons distributed over a network. A central problem to checkpointing and recovery is the ability to track dependencies and arrive at a consistent global checkpoint. Traditionally literature treats one of either distributed shared memory (DSM) or message passing as the interprocess communication mechanism when considering the issue of fault tolerance. This paper describes preliminary investigation into common mechanisms that can be implemented to support a wide variety of protocols in both shared memory and message passing systems. In effect it can be used in a system that combines both these IPC ...
this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable d...
We present a peer logging system for reducing performance overhead in fault-tolerant distributed sha...
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpo...
Backward error recovery involving checkpointing and restart of tasks is an important component of an...
This thesis focuses on the issue of reliability and fault tolerance in Distributed Shared Memory Mul...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
: Distributed Shared Memory (dsm) architectures are attractive to execute high performance parallel ...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Distributed Shared Memory (DSM) systems combine the ease of programming of Shared Memory Parallel Co...
Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require un...
Distributed Shared Memory (DSM) systems combine the ease of programming of shared memory parallel co...
In order to deploy a tightly-coupled multiprocessor (TCMP) in the commercial world, the TCMP must be...
The concept of backward recovery is now well established as a means of restoring a consistent state ...
Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensur...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable d...
We present a peer logging system for reducing performance overhead in fault-tolerant distributed sha...
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpo...
Backward error recovery involving checkpointing and restart of tasks is an important component of an...
This thesis focuses on the issue of reliability and fault tolerance in Distributed Shared Memory Mul...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
: Distributed Shared Memory (dsm) architectures are attractive to execute high performance parallel ...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Distributed Shared Memory (DSM) systems combine the ease of programming of Shared Memory Parallel Co...
Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require un...
Distributed Shared Memory (DSM) systems combine the ease of programming of shared memory parallel co...
In order to deploy a tightly-coupled multiprocessor (TCMP) in the commercial world, the TCMP must be...
The concept of backward recovery is now well established as a means of restoring a consistent state ...
Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensur...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable d...
We present a peer logging system for reducing performance overhead in fault-tolerant distributed sha...
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpo...