Distributed systems are ubiquitous but continue to be challenging to understand, build, and troubleshoot. Fundamentally, reasoning about distributed system behaviors is hard due to the effects of partial failures and nondeterminism in system executions. For example, we expect systems to remain available even if some number of replicas fail. These problems are exacerbated by the dynamic nature and scale of production systems today. Tooling support has lagged behind the pace at which systems are being deployed, urgently requiring more research in this space.Our overarching claim is that many common distributed systems problems such as improving fault tolerance or debugging failures can be addressed by querying observations of executions. Sinc...
Diagnosing and correcting failures in complex, distributed systems is difficult. In a network of per...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
Abstract. We present a three-part approach for diagnosing bugs and performance problems in productio...
Distributed systems are ubiquitous but continue to be challenging to understand, build, and troubles...
When confronted with a buggy execution of a distributed system—which are commonplacefor distributed ...
Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose fai...
Thesis (Ph.D.)--University of Washington, 2019Designing and debugging distributed systems is notorio...
We propose a new approach for developing and deploying distributed systems, in which nodes predict d...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
One of the most challenging problems facing today's software engineer is to understand and modify di...
One of the most challenging problems facing today's software engineer is to understand and modify di...
Diagnosing and repairing problems in complex distributed systems has always been challenging. A wide...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...
Modern software projects are incredible feats of engineering that manage dozens of concurrent execut...
Concurrency faults are one of the most damaging types of faults that can affect the dependability of...
Diagnosing and correcting failures in complex, distributed systems is difficult. In a network of per...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
Abstract. We present a three-part approach for diagnosing bugs and performance problems in productio...
Distributed systems are ubiquitous but continue to be challenging to understand, build, and troubles...
When confronted with a buggy execution of a distributed system—which are commonplacefor distributed ...
Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose fai...
Thesis (Ph.D.)--University of Washington, 2019Designing and debugging distributed systems is notorio...
We propose a new approach for developing and deploying distributed systems, in which nodes predict d...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
One of the most challenging problems facing today's software engineer is to understand and modify di...
One of the most challenging problems facing today's software engineer is to understand and modify di...
Diagnosing and repairing problems in complex distributed systems has always been challenging. A wide...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...
Modern software projects are incredible feats of engineering that manage dozens of concurrent execut...
Concurrency faults are one of the most damaging types of faults that can affect the dependability of...
Diagnosing and correcting failures in complex, distributed systems is difficult. In a network of per...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
Abstract. We present a three-part approach for diagnosing bugs and performance problems in productio...