This paper describes a new method for providing transparent fault tolerance for parallel applications on a network of workstations. We have designed our method in the context of shared object system called SAM, a portable run-time system which provides a global name space and automatic caching of shared data. SAM incorporates a novel design intended to address the problem of the high communication overheads in distributed memory environments and is implemented on a variety of distributed memory platforms. Our fundamental approach to providing fault tolerance is to ensure the replication of all data on more than one workstation using the dynamic caching already providedby SAM. The replicated data is accessible to the local processor like oth...
We present a new approach for building fault-tolerant distributed systems based on distributed trans...
Most methods for programming loosely-coupled systems are based on message-passing. Recently, however...
Distributed data services use redundancy to ensure data availability and survivability. Replication ...
: Distributed Shared Memory (dsm) architectures are attractive to execute high performance parallel ...
This thesis focuses on the issue of reliability and fault tolerance in Distributed Shared Memory Mul...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
A network multicomputer is a multiprocessor in which the processors are connected by general-purpose...
The scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can...
This research proposes an algorithm for fault-tolerance in a home-based lazy release consistent dist...
International audienceDistributed Shared Memory (DSM) architectures are attractive to execute high p...
In this paper, we devise a new method for transparent fault tolerance of distributed programs runnin...
The pervasiveness of cloud-based services has significantly increased the demand for highly dependab...
This paper shows how to define consistency conditions for distributed shared memories in virtually s...
We present a new software architecture in which all concepts necessary to achieve fault tolerance ca...
Current and future multicore architectures can significantly accelerate the performance of test auto...
We present a new approach for building fault-tolerant distributed systems based on distributed trans...
Most methods for programming loosely-coupled systems are based on message-passing. Recently, however...
Distributed data services use redundancy to ensure data availability and survivability. Replication ...
: Distributed Shared Memory (dsm) architectures are attractive to execute high performance parallel ...
This thesis focuses on the issue of reliability and fault tolerance in Distributed Shared Memory Mul...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
A network multicomputer is a multiprocessor in which the processors are connected by general-purpose...
The scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can...
This research proposes an algorithm for fault-tolerance in a home-based lazy release consistent dist...
International audienceDistributed Shared Memory (DSM) architectures are attractive to execute high p...
In this paper, we devise a new method for transparent fault tolerance of distributed programs runnin...
The pervasiveness of cloud-based services has significantly increased the demand for highly dependab...
This paper shows how to define consistency conditions for distributed shared memories in virtually s...
We present a new software architecture in which all concepts necessary to achieve fault tolerance ca...
Current and future multicore architectures can significantly accelerate the performance of test auto...
We present a new approach for building fault-tolerant distributed systems based on distributed trans...
Most methods for programming loosely-coupled systems are based on message-passing. Recently, however...
Distributed data services use redundancy to ensure data availability and survivability. Replication ...