Applicative systems are promising candidates for achieving high performance computing through aggregation of processors. This paper studies the fault recovery problems in a class of applicative systems. The concept of functional checkpointing is proposed as the nucleus of a distributed recovery mechanism. This entails incrementally building a resilient structure as the evaluation of an applicative program proceeds. A simple rollback algorithm is suggested to regenerate the corrupted structure by redoing the most effective functional checkpoints. Another algorithm, which attempts to recover intermediate results, is also presented. The parent of a faulty task reproduces a functional twin of the failed task. The regenerated task inherits all o...
In this work we have addressed the complex problem of recovery for concurrent failures in a distribu...
In the crash-recovery failure model of asynchronous distributed systems, processes can temporarily s...
A novel approach to application fault recovery based on autonomic computing works by accurately moni...
Applicative systems are promising candidates for achieving high performance computing through aggreg...
technical reportApplicative systems are promising candidates to achieve high performance computing t...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
We have addressed the complex problem of recovery for concurrent failures in distributed computing e...
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpo...
This paper describes a checkpoint comparison and optimistic execution technique for error detection ...
We consider the problem of bringing a distributed system to a consistent state after transient fail...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
The reliability of concurrent and distributed systems often depends on some well-known techniques fo...
Traditional reliability-related models for fault-tolerant systems are used to predict system reliabi...
This thesis studies a forward recovery strategy using checkpointing and optimistic execution in para...
Various aspects of reliable computing are formalized and quantified with emphasis on efficient fault...
In this work we have addressed the complex problem of recovery for concurrent failures in a distribu...
In the crash-recovery failure model of asynchronous distributed systems, processes can temporarily s...
A novel approach to application fault recovery based on autonomic computing works by accurately moni...
Applicative systems are promising candidates for achieving high performance computing through aggreg...
technical reportApplicative systems are promising candidates to achieve high performance computing t...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
We have addressed the complex problem of recovery for concurrent failures in distributed computing e...
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpo...
This paper describes a checkpoint comparison and optimistic execution technique for error detection ...
We consider the problem of bringing a distributed system to a consistent state after transient fail...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
The reliability of concurrent and distributed systems often depends on some well-known techniques fo...
Traditional reliability-related models for fault-tolerant systems are used to predict system reliabi...
This thesis studies a forward recovery strategy using checkpointing and optimistic execution in para...
Various aspects of reliable computing are formalized and quantified with emphasis on efficient fault...
In this work we have addressed the complex problem of recovery for concurrent failures in a distribu...
In the crash-recovery failure model of asynchronous distributed systems, processes can temporarily s...
A novel approach to application fault recovery based on autonomic computing works by accurately moni...