This paper presents the performance evaluation of a software fault manager for distributed applications. Dubbed STAR, it uses the natural redundancy existing in networks of workstations to offer a high level of fault tolerance. Fault management is transparent to the supported parallel applications. STAR is application independent, highly configurable and easily portable to UNIX-like operating systems. The current implementation is based on independent checkpointing and message logging. Measurements show the efficiency and the limits of this implementation. The challenge is to show that a software approach to fault tolerance can efficiently be implemented in a standard networked environment. 1. Introduction Few distributed computing environ...
Large scale distributed computing systems have been extensively utilized to host critical applicatio...
In this paper, we devise a new method for transparent fault tolerance of distributed programs runnin...
This thesis addresses issues in building fault-tolerant distributed real-time systems. Such systems ...
Fault tolerance can be defined as a concept of recovery that keeps a computer system operational by ...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Clusters of message-passing computing nodes provide high-performance platforms for distributed appli...
We present a new software architecture in which all concepts necessary to achieve fault tolerance ca...
This paper introduces a network fault model for distributed applications developed with the Mozart p...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
In massively parallel systems (MPS), fault tolerance is indispensable to obtain proper completion o...
This paper addresses the question as to whether there is potential gain to be made from executing su...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
Today’s software engineering and application development trend is to take advantage of reusable soft...
International audienceDistributed computing infrastructures support system and network fault-toleran...
The development of scientic software, reliable and ecient, in distributed computing environments, r...
Large scale distributed computing systems have been extensively utilized to host critical applicatio...
In this paper, we devise a new method for transparent fault tolerance of distributed programs runnin...
This thesis addresses issues in building fault-tolerant distributed real-time systems. Such systems ...
Fault tolerance can be defined as a concept of recovery that keeps a computer system operational by ...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Clusters of message-passing computing nodes provide high-performance platforms for distributed appli...
We present a new software architecture in which all concepts necessary to achieve fault tolerance ca...
This paper introduces a network fault model for distributed applications developed with the Mozart p...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
In massively parallel systems (MPS), fault tolerance is indispensable to obtain proper completion o...
This paper addresses the question as to whether there is potential gain to be made from executing su...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
Today’s software engineering and application development trend is to take advantage of reusable soft...
International audienceDistributed computing infrastructures support system and network fault-toleran...
The development of scientic software, reliable and ecient, in distributed computing environments, r...
Large scale distributed computing systems have been extensively utilized to host critical applicatio...
In this paper, we devise a new method for transparent fault tolerance of distributed programs runnin...
This thesis addresses issues in building fault-tolerant distributed real-time systems. Such systems ...