We propose a new approach for developing and deploying distributed systems, in which nodes predict distributed consequences of their actions, and use this information to detect and avoid errors. Each node continuously runs a state exploration algorithm on a recent consistent snapshot of its neighborhood and predicts possible future violations of specified safety properties. We describe a new state exploration algorithm, consequence prediction, which explores causally related chains of events that lead to property violation. This article describes the design and implementation of this approach, termed CrystalBall. We evaluate CrystalBall on RandTree, BulletPrime, Paxos, and Chord distributed system implementations. We identified new bugs in ...
As cloud computing becomes increasingly popular, there is a growing need for replicated distributed ...
In distributed systems, if a hardware fault corrupts the state of a process, this error might propag...
It is notoriously hard to develop dependable distributed systems. This is partly due to the difficul...
We propose a new approach for developing and deploying distributed systems, in which nodes predict d...
We propose a new approach for developing and deploying distributed systems, in which nodes predict d...
We consider the problem of predicting faults in deployed, large-scale distributed systems that are h...
Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose fai...
Robust distributed systems commonly employ high-level recov-ery mechanisms enabling the system to re...
Distributed systems are ubiquitous but continue to be challenging to understand, build, and troubles...
It is notoriously difficult to develop reliable, high-performance distributed systems that run over ...
An extension of the Chandy-Lamport algorithm ([Chan84]) to find global states of distributed system...
This paper presents an algorithm by which a process in a distributed system determines a global stat...
This paper presents an algorithm by which a process in a distributed system determines a global stat...
The distributed systems research community has developed many provably correct algorithms and abstra...
In debugging distributed programs a distinction is made between an observed error and the program fa...
As cloud computing becomes increasingly popular, there is a growing need for replicated distributed ...
In distributed systems, if a hardware fault corrupts the state of a process, this error might propag...
It is notoriously hard to develop dependable distributed systems. This is partly due to the difficul...
We propose a new approach for developing and deploying distributed systems, in which nodes predict d...
We propose a new approach for developing and deploying distributed systems, in which nodes predict d...
We consider the problem of predicting faults in deployed, large-scale distributed systems that are h...
Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose fai...
Robust distributed systems commonly employ high-level recov-ery mechanisms enabling the system to re...
Distributed systems are ubiquitous but continue to be challenging to understand, build, and troubles...
It is notoriously difficult to develop reliable, high-performance distributed systems that run over ...
An extension of the Chandy-Lamport algorithm ([Chan84]) to find global states of distributed system...
This paper presents an algorithm by which a process in a distributed system determines a global stat...
This paper presents an algorithm by which a process in a distributed system determines a global stat...
The distributed systems research community has developed many provably correct algorithms and abstra...
In debugging distributed programs a distinction is made between an observed error and the program fa...
As cloud computing becomes increasingly popular, there is a growing need for replicated distributed ...
In distributed systems, if a hardware fault corrupts the state of a process, this error might propag...
It is notoriously hard to develop dependable distributed systems. This is partly due to the difficul...