Abstract — A critical problem facing by managing large-scale clusters is to identify the location of problems in a system in case of unusual events. As the scale of high performance computing (HPC) grows, systems are getting bigger. When a system fails to function properly, health-related data are collected for troubleshooting. However, due to the massive quantities of information obtained from a large number of components, the root causes of anomalies are often buried like needles in a haystack. In this paper, we present a localization method to automatically find out the potential root causes (i.e. a subset of nodes) of the problem from the overwhelming amount of data collected system-wide. System managers can focus on examining these pot...
Many social and economic systems can be represented as attributed networks encoding the relations be...
A clustering model identification method based on the statistics has been proposed to improve the ab...
Many social and economic systems can be represented as attributed networks encoding the relations be...
In response to the demand for higher computational power, the number of computing nodes in high perf...
We describe a new fault localization technique for software bugs in large-scale computing systems. O...
Large microservice clusters deployed in the cloud can be very di\u81fficult to both monitor and debu...
Operation and maintenance of large distributed cloud applications can quickly become unmanageably co...
We consider the problem of network anomaly detection in large distributed systems. In this setting, ...
Determining anomalies in data streams that are collected and transformed from various types of netwo...
In this paper, we describe disparity, a tool that does parallel, scalable anomaly detection for clus...
Distributed applications running inside cloud are prone to performance anomalies due to various reas...
The increasing complexity of modern high-performance computing (HPC) systems necessitates the introd...
Principal component analysis and the residual error is an effective anomaly detection technique. In ...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
The concept of statistical anomalies, or outliers, has fascinated experimentalists since the earlies...
Many social and economic systems can be represented as attributed networks encoding the relations be...
A clustering model identification method based on the statistics has been proposed to improve the ab...
Many social and economic systems can be represented as attributed networks encoding the relations be...
In response to the demand for higher computational power, the number of computing nodes in high perf...
We describe a new fault localization technique for software bugs in large-scale computing systems. O...
Large microservice clusters deployed in the cloud can be very di\u81fficult to both monitor and debu...
Operation and maintenance of large distributed cloud applications can quickly become unmanageably co...
We consider the problem of network anomaly detection in large distributed systems. In this setting, ...
Determining anomalies in data streams that are collected and transformed from various types of netwo...
In this paper, we describe disparity, a tool that does parallel, scalable anomaly detection for clus...
Distributed applications running inside cloud are prone to performance anomalies due to various reas...
The increasing complexity of modern high-performance computing (HPC) systems necessitates the introd...
Principal component analysis and the residual error is an effective anomaly detection technique. In ...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
The concept of statistical anomalies, or outliers, has fascinated experimentalists since the earlies...
Many social and economic systems can be represented as attributed networks encoding the relations be...
A clustering model identification method based on the statistics has been proposed to improve the ab...
Many social and economic systems can be represented as attributed networks encoding the relations be...