Diagnosing and repairing problems in complex distributed systems has always been challenging. A wide variety of problems can happen in distributed systems: routers can be misconfigured, nodes can be hacked, and the control software can have bugs. This is further complicated by the complexity and scale of today’s distributed systems. Provenance is an attractive way to diagnose faults in distributed systems, because it can track the causality from a symptom to a set of root causes. Prior work on network provenance has successfully applied provenance to distributed systems. However, they cannot explain problems beyond the presence of faulty events and offer limited help with finding repairs. In this dissertation, we extend provenance to handle...
International audienceRunning experiments on modern systems like supercomput-ers, cloud infrastructu...
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer S...
textThis dissertation presents techniques for detecting and tolerating faults in distributed systems...
Diagnosing and repairing problems in complex distributed systems has always been challenging. A wide...
In large-scale networks, many things can go wrong: routers can be misconfigured, programs can be bug...
In large-scale networks, many things can go wrong: routers can be misconfigured, programs can be bug...
In this paper, we propose a new approach to diagnosing prob-lems in complex networks. Our approach i...
Operators of distributed systems often find themselves needing to answer forensic questions, to perf...
In this paper, we explore the use of provenance for analyzing execution dynamics in distributed syst...
When debugging a distributed system, it is sometimes necessary to explain the absence of an event – ...
Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose fai...
We demonstrate NetTrails, a declarative platform for maintaining and interactively querying network ...
Today’s distributed systems are becoming increasingly complex, due to the ever-growing number of net...
Network accountability, forensic analysis, and failure diagnosis are becoming increasingly important...
The ability to reason about changes in a distributed system’s state enables network administrators t...
International audienceRunning experiments on modern systems like supercomput-ers, cloud infrastructu...
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer S...
textThis dissertation presents techniques for detecting and tolerating faults in distributed systems...
Diagnosing and repairing problems in complex distributed systems has always been challenging. A wide...
In large-scale networks, many things can go wrong: routers can be misconfigured, programs can be bug...
In large-scale networks, many things can go wrong: routers can be misconfigured, programs can be bug...
In this paper, we propose a new approach to diagnosing prob-lems in complex networks. Our approach i...
Operators of distributed systems often find themselves needing to answer forensic questions, to perf...
In this paper, we explore the use of provenance for analyzing execution dynamics in distributed syst...
When debugging a distributed system, it is sometimes necessary to explain the absence of an event – ...
Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose fai...
We demonstrate NetTrails, a declarative platform for maintaining and interactively querying network ...
Today’s distributed systems are becoming increasingly complex, due to the ever-growing number of net...
Network accountability, forensic analysis, and failure diagnosis are becoming increasingly important...
The ability to reason about changes in a distributed system’s state enables network administrators t...
International audienceRunning experiments on modern systems like supercomput-ers, cloud infrastructu...
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer S...
textThis dissertation presents techniques for detecting and tolerating faults in distributed systems...