Dependability is becoming a requirement in an increasing number of domains, including those that were previously thought to be noncritical. Examples include large distributed systems deployed in domains such as e-commerce, information mining, messaging, and entertainment. Such systems provide a challenge to existing fault tolerance approaches because of their requirements for low-cost solutions that can be adapted to work with off-the-shelf components. At the same time, their scale makes it difficult to accurately diagnose faults and recover from them. This dissertation proposes a model-based solution to building a theoretically well-founded recovery framework based on partially observable Markov decision processes that is inexpensive to...
Large-scale decentralized systems of autonomous agents interacting via asynchronous communication of...
Traditionally, fault-tolerant systems assume that failures are independent, often expressed as a thr...
Recent years have seen a growth in research on system reliability and maintenance. Various studies i...
Dependability is becoming a requirement in an increasing number of domains, including those that wer...
186 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2007.We are unaware of any other f...
Automatic system monitoring and recovery has the potential to provide a low-cost solution for high a...
Distributed applications executing in uncertain environments, like the Internet, need to make timing...
textFault-tolerant distributed systems often handle failures in two steps: first, detect the failure...
This book covers the most essential techniques for designing and building dependable distributed sys...
textThis dissertation presents techniques for detecting and tolerating faults in distributed systems...
textFor the last 40 years, the systems community has invested a lot of effort in designing technique...
Distributed programs are hard to get right because they are required to be open, scalable, long-runn...
It is of great importance to operate a computer system with high reliability. Several techniques to ...
Distributed programs are hard to get right because they are required to be open, scalable, long-runn...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
Large-scale decentralized systems of autonomous agents interacting via asynchronous communication of...
Traditionally, fault-tolerant systems assume that failures are independent, often expressed as a thr...
Recent years have seen a growth in research on system reliability and maintenance. Various studies i...
Dependability is becoming a requirement in an increasing number of domains, including those that wer...
186 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2007.We are unaware of any other f...
Automatic system monitoring and recovery has the potential to provide a low-cost solution for high a...
Distributed applications executing in uncertain environments, like the Internet, need to make timing...
textFault-tolerant distributed systems often handle failures in two steps: first, detect the failure...
This book covers the most essential techniques for designing and building dependable distributed sys...
textThis dissertation presents techniques for detecting and tolerating faults in distributed systems...
textFor the last 40 years, the systems community has invested a lot of effort in designing technique...
Distributed programs are hard to get right because they are required to be open, scalable, long-runn...
It is of great importance to operate a computer system with high reliability. Several techniques to ...
Distributed programs are hard to get right because they are required to be open, scalable, long-runn...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
Large-scale decentralized systems of autonomous agents interacting via asynchronous communication of...
Traditionally, fault-tolerant systems assume that failures are independent, often expressed as a thr...
Recent years have seen a growth in research on system reliability and maintenance. Various studies i...