This paper describes libra, a library to support efficient reliable distributed applications. Libra is designed to meet two objectives: to simplify the development of reliable distributed applications, and to achieve fault-tolerance at low run-time cost. The first objective is met by the provision of fault-tolerance transparency and a simple, easy to use high-level message passing interface. Fault-tolerance is provided to applications transparently by libra and is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level network communication protocol. The second objective is met by the use of protocols which minimise communication overhead for taking a consistent distributed checkpoint and catching me...
Today’s software engineering and application development trend is to take advantage of reusable soft...
The increasing failure rate in High Performance Computing encourages the investigation of fault tole...
Today, many distributed systems are deployed in high-performance computing environments such as a mu...
This paper describes libra, a library to support efficient reliable distributed applications. libra ...
This book covers the most essential techniques for designing and building dependable distributed sys...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
IONS FOR RELIABLE DISTRIBUTED COMPUTING Reliable distributed systems are challenging to build becau...
Clusters of message-passing computing nodes provide high-performance platforms for distributed appli...
International audienceThis book presents the most important fault-tolerant distributed programming a...
This paper presents the performance evaluation of a software fault manager for distributed applicati...
As human dependence on computing technology increases, so does the need for computer system dependab...
We present a new approach for building fault-tolerant distributed systems based on distributed trans...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
In recent years, the study of distributed systems has become an increasingly important focus of comp...
Today’s software engineering and application development trend is to take advantage of reusable soft...
The increasing failure rate in High Performance Computing encourages the investigation of fault tole...
Today, many distributed systems are deployed in high-performance computing environments such as a mu...
This paper describes libra, a library to support efficient reliable distributed applications. libra ...
This book covers the most essential techniques for designing and building dependable distributed sys...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
IONS FOR RELIABLE DISTRIBUTED COMPUTING Reliable distributed systems are challenging to build becau...
Clusters of message-passing computing nodes provide high-performance platforms for distributed appli...
International audienceThis book presents the most important fault-tolerant distributed programming a...
This paper presents the performance evaluation of a software fault manager for distributed applicati...
As human dependence on computing technology increases, so does the need for computer system dependab...
We present a new approach for building fault-tolerant distributed systems based on distributed trans...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
In recent years, the study of distributed systems has become an increasingly important focus of comp...
Today’s software engineering and application development trend is to take advantage of reusable soft...
The increasing failure rate in High Performance Computing encourages the investigation of fault tole...
Today, many distributed systems are deployed in high-performance computing environments such as a mu...