In this paper, we devise a new method for transparent fault tolerance of distributed programs running on a cluster of networked workstations. We use the concept of alternative schedules for this purpose. Such schedules are generated from static task graphs at compile-time. At run-time a distributed program can use these alternatives to switch from one schedule to another if some machine/s become faulty. We have devised fast but efficient mechanisms for switching among schedules at run-time. This enables fault recovery from any number of simultaneous machine faults any number of times. The correctness of the resultant algorithm is ensured through prevention of direct data sharing among local tasks on a machine. Such a transparent fault toler...
We present a new software architecture in which all concepts necessary to achieve fault tolerance ca...
Clusters of message-passing computing nodes provide high-performance platforms for distributed appli...
Distributed collaboration systems must allow dynamic joining and leaving of sessions and therefore m...
A Cluster of Workstations (COW) is network based multi-computer system aimed to replace supercompute...
This paper presents the performance evaluation of a software fault manager for distributed applicati...
This paper describes a new method for providing transparent fault tolerance for parallel application...
In this article, we propose a strategy for the synthesis of fault-tolerant schedules and for the map...
This paper introduces a network fault model for distributed applications developed with the Mozart p...
Our goal is to automatically obtain a distributed and fault-tolerant embedded system: distributed be...
In distributed systems, a real-time task has several subtasks which need to be executed at different...
In this paper, we propose an efficient scheduling algorithm for problems in which tasks with precede...
This paper addresses the question as to whether there is potential gain to be made from executing su...
This paper shows that asynchronous fault detection is a practical way to reflect partial failure in ...
CDInternational audienceBecause fault failures tend to affect whole areas, in some cases, and not on...
The scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can...
We present a new software architecture in which all concepts necessary to achieve fault tolerance ca...
Clusters of message-passing computing nodes provide high-performance platforms for distributed appli...
Distributed collaboration systems must allow dynamic joining and leaving of sessions and therefore m...
A Cluster of Workstations (COW) is network based multi-computer system aimed to replace supercompute...
This paper presents the performance evaluation of a software fault manager for distributed applicati...
This paper describes a new method for providing transparent fault tolerance for parallel application...
In this article, we propose a strategy for the synthesis of fault-tolerant schedules and for the map...
This paper introduces a network fault model for distributed applications developed with the Mozart p...
Our goal is to automatically obtain a distributed and fault-tolerant embedded system: distributed be...
In distributed systems, a real-time task has several subtasks which need to be executed at different...
In this paper, we propose an efficient scheduling algorithm for problems in which tasks with precede...
This paper addresses the question as to whether there is potential gain to be made from executing su...
This paper shows that asynchronous fault detection is a practical way to reflect partial failure in ...
CDInternational audienceBecause fault failures tend to affect whole areas, in some cases, and not on...
The scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can...
We present a new software architecture in which all concepts necessary to achieve fault tolerance ca...
Clusters of message-passing computing nodes provide high-performance platforms for distributed appli...
Distributed collaboration systems must allow dynamic joining and leaving of sessions and therefore m...