system performance diagnosis, machine learning, transfer learning, scalability Distributed systems continue to grow in scale and complexity, resulting in increasingly more involved interactions among components and increasingly more intricate failure modes that are very hard to diagnose manually. This increased vulnerability of larger systems, together with the increased difficulty of failure diagnosis, has motivated machine learning approaches to automate the diagnosis task. While preliminary encouraging results are achieved, scaling up the existing approaches to large applications remains challenging. With increase in scale, current approaches suffer the curse of dimensionality exacerbated by the exploding set of system states and measure...
Abstract—In this paper, we propose an algorithm to effi-ciently diagnose large-scale clustered failu...
Diagnosing performance problems in modern datacenters and distributed systems is challenging, as the...
For constructing fault tolerance mechanisms in large massively parallel multiprocessor systems, a s...
156 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2007.The self-diagnosing capabilit...
Large production systems are susceptible to chronic performance problems where the system still work...
In this paper, we address the problem of efficient diagnosis in real-time systems capable of on-line...
Distributed systems form an integral part of human life—from ATMs to the Domain Name Service. Typica...
[[abstract]]It is important to keep an information system work properly with efficient performance i...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...
As today\u27s distributed applications increase in complexity, it becomes increasingly difficult to ...
Providing contractual performance assurances in distributed systems is an important and challenging ...
required to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challeng...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
To ensure high availability, self-managing systems require self-monitoring and a system model agains...
In today\u27s world where distributed systems form many of our critical infrastructures, dependabili...
Abstract—In this paper, we propose an algorithm to effi-ciently diagnose large-scale clustered failu...
Diagnosing performance problems in modern datacenters and distributed systems is challenging, as the...
For constructing fault tolerance mechanisms in large massively parallel multiprocessor systems, a s...
156 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2007.The self-diagnosing capabilit...
Large production systems are susceptible to chronic performance problems where the system still work...
In this paper, we address the problem of efficient diagnosis in real-time systems capable of on-line...
Distributed systems form an integral part of human life—from ATMs to the Domain Name Service. Typica...
[[abstract]]It is important to keep an information system work properly with efficient performance i...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...
As today\u27s distributed applications increase in complexity, it becomes increasingly difficult to ...
Providing contractual performance assurances in distributed systems is an important and challenging ...
required to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challeng...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
To ensure high availability, self-managing systems require self-monitoring and a system model agains...
In today\u27s world where distributed systems form many of our critical infrastructures, dependabili...
Abstract—In this paper, we propose an algorithm to effi-ciently diagnose large-scale clustered failu...
Diagnosing performance problems in modern datacenters and distributed systems is challenging, as the...
For constructing fault tolerance mechanisms in large massively parallel multiprocessor systems, a s...