A significant challenge in developing automated problem-diagnosis tools for distributed systems is the ability of these tools to differentiate between changes in system behavior due to workload changes from those due to faults. To address this challenge, current, typically white-box, techniques extract semantically-rich knowledge about the target application through fairly invasive, high-overhead instrumentation. We propose and explore two scalable, low-overhead, non-invasive techniques to infer semantics about target distributed systems, in a black-box manner, to facilitate problem diagnosis. RAMS applies statistical analysis on hardware performance counters to predict whether a given node in a distributed system is faulty, while BlackShee...
As today\u27s distributed applications increase in complexity, it becomes increasingly difficult to ...
Bugs in distributed systems are often hard to find. Many bugs reflect discrepancies between a system...
This paper introduces a novel approach to failure prediction for mission critical distributed system...
This research was made possible by the guidance of Priya Narasimhan A significant challenge in devel...
Many interesting large-scale systems are distributed systems of multiple communicating components. S...
Large production systems are susceptible to chronic performance problems where the system still work...
Diagnosing performance problems in modern datacenters and distributed systems is challenging, as the...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...
This paper discusses a methodology for diagnosing performance problems for parallel and distributed ...
Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose fai...
In order to prevent violation of service-level objectives and to guarantee good user experience, det...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
Abstract—To diagnose performance problems in production systems, many OS kernel-level monitoring and...
Applications may have unintended performance problems in spite of compiler optimizations, because of...
Modern distributed systems are characterized by a growing complexity of their architecture, function...
As today\u27s distributed applications increase in complexity, it becomes increasingly difficult to ...
Bugs in distributed systems are often hard to find. Many bugs reflect discrepancies between a system...
This paper introduces a novel approach to failure prediction for mission critical distributed system...
This research was made possible by the guidance of Priya Narasimhan A significant challenge in devel...
Many interesting large-scale systems are distributed systems of multiple communicating components. S...
Large production systems are susceptible to chronic performance problems where the system still work...
Diagnosing performance problems in modern datacenters and distributed systems is challenging, as the...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...
This paper discusses a methodology for diagnosing performance problems for parallel and distributed ...
Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose fai...
In order to prevent violation of service-level objectives and to guarantee good user experience, det...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
Abstract—To diagnose performance problems in production systems, many OS kernel-level monitoring and...
Applications may have unintended performance problems in spite of compiler optimizations, because of...
Modern distributed systems are characterized by a growing complexity of their architecture, function...
As today\u27s distributed applications increase in complexity, it becomes increasingly difficult to ...
Bugs in distributed systems are often hard to find. Many bugs reflect discrepancies between a system...
This paper introduces a novel approach to failure prediction for mission critical distributed system...