With the growth of system size and complexity, reliability has become a major concern for large-scale systems. Up-on the occurrence of failure, system administrators typically trace the events in Reliability, Availability, and Serviceabil-ity (RAS) logs for root cause diagnosis. However, RAS log only contains limited diagnosis information. Moreover, the manual processing is time-consuming, error-prone, and not scalable. To address the problem, in this paper we present an automated root cause diagnosis mechanism for large-scale HPC systems. Our mechanism examines multiple logs to provide a 3-D fine-grained root cause analysis. Here, 3-D means that our analysis will pinpoint the failure layer, the time, and the location of the event that caus...
Failures have expensive implications in HPC (High- Performance Computing) systems. Consequently, eff...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
The growing computational and storage needs of scientific applications mandate the deployment of ext...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
Abstract—With the growth of system size and complexity, reliability has become of paramount importan...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
A large percentage of computing capacity in todays large high-performance computing systems is waste...
Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose fai...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
The need for computer systems to be reliable has increasingly become important as the dependence on ...
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. I...
Failures have expensive implications in HPC (High- Performance Computing) systems. Consequently, eff...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
The growing computational and storage needs of scientific applications mandate the deployment of ext...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
Abstract—With the growth of system size and complexity, reliability has become of paramount importan...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
A large percentage of computing capacity in todays large high-performance computing systems is waste...
Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose fai...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
The need for computer systems to be reliable has increasingly become important as the dependence on ...
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. I...
Failures have expensive implications in HPC (High- Performance Computing) systems. Consequently, eff...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
The growing computational and storage needs of scientific applications mandate the deployment of ext...