Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose failures early to improve the reliability of systems. In this dissertation, new approaches on root-cause diagnosis for two notorious types of failures in distributed systems are introduced. This dissertation first focuses on the failures that are caused by software bugs triggered by race conditions. Due to the non-deterministic manifestation, these bugs are much harder to diagnose, fix and test than the bugs in sequential logic. To understand the concurrency bugs, we first study the characteristics of concurrency bugs using 105 bugs of four representative open-source programs. Motivated by the interesting findings from the study, we also pro...
required to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challeng...
During the past few years distributed systems have been the focus of considerable research in comput...
The ever-increasing parallelism in computer systems has made software more prone to concurrency fail...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
Diagnosing and repairing problems in complex distributed systems has always been challenging. A wide...
As today\u27s distributed applications increase in complexity, it becomes increasingly difficult to ...
We consider issues of fault tolerance for distributed computing systems at two levels of system desi...
The distributed systems research community has developed many provably correct algorithms and abstra...
Concurrency faults are one of the most damaging types of faults that can affect the dependability of...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
The occurrence of faults is a common feature in most networks and addressing this issue is an import...
Large, production quality distributed systems still fail pe-riodically, and do so sometimes catastro...
In today\u27s world where distributed systems form many of our critical infrastructures, dependabili...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...
Software that performs well in one environment may be unusably slow in another, and determining the ...
required to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challeng...
During the past few years distributed systems have been the focus of considerable research in comput...
The ever-increasing parallelism in computer systems has made software more prone to concurrency fail...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
Diagnosing and repairing problems in complex distributed systems has always been challenging. A wide...
As today\u27s distributed applications increase in complexity, it becomes increasingly difficult to ...
We consider issues of fault tolerance for distributed computing systems at two levels of system desi...
The distributed systems research community has developed many provably correct algorithms and abstra...
Concurrency faults are one of the most damaging types of faults that can affect the dependability of...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
The occurrence of faults is a common feature in most networks and addressing this issue is an import...
Large, production quality distributed systems still fail pe-riodically, and do so sometimes catastro...
In today\u27s world where distributed systems form many of our critical infrastructures, dependabili...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...
Software that performs well in one environment may be unusably slow in another, and determining the ...
required to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challeng...
During the past few years distributed systems have been the focus of considerable research in comput...
The ever-increasing parallelism in computer systems has made software more prone to concurrency fail...