As today\u27s distributed applications increase in complexity, it becomes increasingly difficult to detect errors and performance anomalies in these applications. In addition, some faults only manifest when the application is deployed at large scale. Most of the existing debugging tools scale poorly and do not automate the process of finding the origin of failures. Although it is desirable to automatically predict impending failures, most of the existing error detection approaches do not predict failures. T his dissertation proposes scalable techniques for error detection, problem localization, and failure prediction for distributed applications. First, an error detection and diagnosis technique for scientific applications is presented. The...
Large, production quality distributed systems still fail pe-riodically, and do so sometimes catastro...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
When failures occur during software testing, automated software fault localization helps to diagnose...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...
Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose fai...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
Developing correct and efficient software for large scale systems is a challenging task. Developers ...
We propose a new fault localization technique for software bugs in large-scale computing systems. Ou...
Today's largest systems have over 100,000 cores, with million-core systems expected over the next fe...
We consider the problem of predicting faults in deployed, large-scale distributed systems that are h...
Abstract. For dependability outages in distributed internet infrastructures, it is often not enough ...
required to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challeng...
Distributed systems form an integral part of human life—from ATMs to the Domain Name Service. Typica...
For constructing fault tolerance mechanisms in large massively parallel multipro-cessor systems, a s...
Resiliency of exascale systems has quickly become an important concern for the scientific community....
Large, production quality distributed systems still fail pe-riodically, and do so sometimes catastro...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
When failures occur during software testing, automated software fault localization helps to diagnose...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...
Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose fai...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
Developing correct and efficient software for large scale systems is a challenging task. Developers ...
We propose a new fault localization technique for software bugs in large-scale computing systems. Ou...
Today's largest systems have over 100,000 cores, with million-core systems expected over the next fe...
We consider the problem of predicting faults in deployed, large-scale distributed systems that are h...
Abstract. For dependability outages in distributed internet infrastructures, it is often not enough ...
required to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challeng...
Distributed systems form an integral part of human life—from ATMs to the Domain Name Service. Typica...
For constructing fault tolerance mechanisms in large massively parallel multipro-cessor systems, a s...
Resiliency of exascale systems has quickly become an important concern for the scientific community....
Large, production quality distributed systems still fail pe-riodically, and do so sometimes catastro...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
When failures occur during software testing, automated software fault localization helps to diagnose...