The distributed systems research community has developed many provably correct algorithms and abstractions that are in wide use. However, practical implementations of distributed systems often contain many bugs, and practitioners spend much of their time trou-bleshooting these bugs. In this paper we present an algorithm, ret-rospective causal inference, to ease the burden of troubleshooting. We end by enumerating several open research problems related to the troubleshooting process
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
This book covers the most essential techniques for designing and building dependable distributed sys...
In this paper, a methodology for distributed fault diagnosis is proposed. The algorithm p...
Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose fai...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
Abstract: Formal methods for deciding the properties of service oriented systems are of paramount im...
Diagnosing and repairing problems in complex distributed systems has always been challenging. A wide...
Abstract: The paper describes a novel framework for using causal models in distributed fault diagnos...
This document describes the research performed on fault isolation in dynamic distributed systems at ...
We propose a new approach for developing and deploying distributed systems, in which nodes predict d...
We consider issues of fault tolerance for distributed computing systems at two levels of system desi...
Abstract. In a distributed environment, several components collabo-rate with each other to cater a c...
When confronted with a buggy execution of a distributed system—which are commonplacefor distributed ...
This paper describes a method for automated analysis of fault-tolerance properties of distributed sy...
A causal distributed breakpoint is initiated by a sequential breakpoint in one process of a distribu...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
This book covers the most essential techniques for designing and building dependable distributed sys...
In this paper, a methodology for distributed fault diagnosis is proposed. The algorithm p...
Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose fai...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
Abstract: Formal methods for deciding the properties of service oriented systems are of paramount im...
Diagnosing and repairing problems in complex distributed systems has always been challenging. A wide...
Abstract: The paper describes a novel framework for using causal models in distributed fault diagnos...
This document describes the research performed on fault isolation in dynamic distributed systems at ...
We propose a new approach for developing and deploying distributed systems, in which nodes predict d...
We consider issues of fault tolerance for distributed computing systems at two levels of system desi...
Abstract. In a distributed environment, several components collabo-rate with each other to cater a c...
When confronted with a buggy execution of a distributed system—which are commonplacefor distributed ...
This paper describes a method for automated analysis of fault-tolerance properties of distributed sy...
A causal distributed breakpoint is initiated by a sequential breakpoint in one process of a distribu...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
This book covers the most essential techniques for designing and building dependable distributed sys...
In this paper, a methodology for distributed fault diagnosis is proposed. The algorithm p...