Abstract—In this paper, we propose an algorithm to effi-ciently diagnose large-scale clustered failures. The algorithm, Cluster-MAX-COVERAGE (CMC), is based on greedy ap-proach. We address the challenge of determining faults with incomplete symptoms. CMC makes novel use of both positive and negative symptoms to output a hypothesis list with a low number of false negatives and false positives quickly. CMC requires reports from about half as many nodes as other existing algorithms to determine failures with 100 % accuracy. Moreover, CMC accomplishes this gain significantly faster (sometimes by two orders of magnitude) than an algorithm that matches its accuracy. Furthermore, we propose an adaptive algorithm called Adaptive-MAX-COVERAGE (AMC) ...
As today\u27s distributed applications increase in complexity, it becomes increasingly difficult to ...
We consider the problem of adaptive fault diagnosis in hypercube multiprocessor systems. Processors ...
system performance diagnosis, machine learning, transfer learning, scalability Distributed systems c...
This paper presents a scalable, adaptive and time-bounded general approach to assure reliable, real-...
Future extreme-scale high-performance computing systems will be required to work under frequent com...
Distributed systems and extreme-scale systems are ubiquitous in recent years and have seen throughou...
To improve the whole dependability of large-scale cluster systems, an online fault detection mechani...
In this paper we propose a scalable failure detection service for large scale ad hoc networks using ...
It is the age of information technology. Around the world, millions of computers are being linked t...
Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose fai...
Failure detection is a basic service for building dependable systems. The large-scale distribution o...
Large-scale clusters are growing at a rapid pace, and the resulting amount of monitoring data produc...
Abstract. Past research on probing-based network monitoring provides solutions based on preplanned p...
In this paper, we address the problem of efficient diagnosis in real-time systems capable of on-line...
required to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challeng...
As today\u27s distributed applications increase in complexity, it becomes increasingly difficult to ...
We consider the problem of adaptive fault diagnosis in hypercube multiprocessor systems. Processors ...
system performance diagnosis, machine learning, transfer learning, scalability Distributed systems c...
This paper presents a scalable, adaptive and time-bounded general approach to assure reliable, real-...
Future extreme-scale high-performance computing systems will be required to work under frequent com...
Distributed systems and extreme-scale systems are ubiquitous in recent years and have seen throughou...
To improve the whole dependability of large-scale cluster systems, an online fault detection mechani...
In this paper we propose a scalable failure detection service for large scale ad hoc networks using ...
It is the age of information technology. Around the world, millions of computers are being linked t...
Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose fai...
Failure detection is a basic service for building dependable systems. The large-scale distribution o...
Large-scale clusters are growing at a rapid pace, and the resulting amount of monitoring data produc...
Abstract. Past research on probing-based network monitoring provides solutions based on preplanned p...
In this paper, we address the problem of efficient diagnosis in real-time systems capable of on-line...
required to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challeng...
As today\u27s distributed applications increase in complexity, it becomes increasingly difficult to ...
We consider the problem of adaptive fault diagnosis in hypercube multiprocessor systems. Processors ...
system performance diagnosis, machine learning, transfer learning, scalability Distributed systems c...