When a performance crisis occurs in a datacenter, rapid recovery requires quickly recognizing whether a similar incident occurred before, in which case a known rem-edy may apply, or whether the problem is new, in which case new troubleshooting is necessary. To address this issue we propose a new and efficient representation of the datacenter’s state, a fingerprint, that scales linearly with the number of performance metrics considered and it is not affected by the number of machines. These fin-gerprints are generated online and then used as unique identifiers of the different types of performance crises so that we can effectively recognize previous occurrences and retrieve repair actions. We evaluate our approach on a production datacenter ...
In the current network-based computing world, where the number of interconnected devices grows expon...
Thesis (Ph.D.)--University of Washington, 2018Fast and accurate failure diagnosis remains a major ch...
Abstract—In this paper, we present CLUE, a system event analytics tool for black-box performance dia...
Contemporary datacenters comprise hundreds or thousands of machines running applications requiring h...
Large production systems are susceptible to chronic performance problems where the system still work...
The proliferation of distributed internet services has reaffirmed the need for reliable and high-per...
Distributed computing environments are increasingly deployed over geographically spanning data cente...
Modern architectures provide access to many hardware performance events, which are capable of provid...
Cloud datacenters comprise hundreds or thousands of disparate application services, each having stri...
Large-scale data center networks are complex - comprising several thousand network devices and sever...
Abstract—Detecting and localizing performance faults is cru-cial for operating large enterprise data...
Fingerprinting summarizes the history of internal processor state updates into a cryptographic signa...
Diagnosing performance problems in modern datacenters and distributed systems is challenging, as the...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
The pervasive digital innovation of the last decades has led to a remarkable transformation of maint...
In the current network-based computing world, where the number of interconnected devices grows expon...
Thesis (Ph.D.)--University of Washington, 2018Fast and accurate failure diagnosis remains a major ch...
Abstract—In this paper, we present CLUE, a system event analytics tool for black-box performance dia...
Contemporary datacenters comprise hundreds or thousands of machines running applications requiring h...
Large production systems are susceptible to chronic performance problems where the system still work...
The proliferation of distributed internet services has reaffirmed the need for reliable and high-per...
Distributed computing environments are increasingly deployed over geographically spanning data cente...
Modern architectures provide access to many hardware performance events, which are capable of provid...
Cloud datacenters comprise hundreds or thousands of disparate application services, each having stri...
Large-scale data center networks are complex - comprising several thousand network devices and sever...
Abstract—Detecting and localizing performance faults is cru-cial for operating large enterprise data...
Fingerprinting summarizes the history of internal processor state updates into a cryptographic signa...
Diagnosing performance problems in modern datacenters and distributed systems is challenging, as the...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
The pervasive digital innovation of the last decades has led to a remarkable transformation of maint...
In the current network-based computing world, where the number of interconnected devices grows expon...
Thesis (Ph.D.)--University of Washington, 2018Fast and accurate failure diagnosis remains a major ch...
Abstract—In this paper, we present CLUE, a system event analytics tool for black-box performance dia...