© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become normal. System logs play a critical role in the increasingly complex tasks of automatic failure prediction and diagnosis. Many methods for failure prediction are based on analyzing event logs for large scale systems, but there is still neither a widely used one to predict failures based on both non-fatal and fatal events, nor a precise one that uses fine-grained information (such as failure type, node location, related application, and time of occurrence). A deeper and more precise log analysis technique is needed. We propose a three-step approach to draw out event dependencies and to identify failure-event generating processes. First, we clu...
This paper introduces a failure analysis procedure that underpins real-time fault prognosis. In the ...
Failures have expensive implications in HPC (High- Performance Computing) systems. Consequently, eff...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
The level of trust on log-based dependability characterization of complex distributed systems, is bi...
Research in the field of failure log analysis shows that spatial and temporal patterns exist among e...
The ability to automatically detect faults or fault patterns to enhance system reliability is import...
Failure analysis is valuable to dependability engineers because it supports designing effective miti...
Log preprocessing, a process applied on the raw log be-fore applying a predictive method, is of para...
System logs are the first source of information available to system designers to analyze and trouble...
Abstract — System logs are an important tool in studying the conditions (e.g., environment misconfig...
Software faults are recognized to be among the main responsible for system failures in many applicat...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
Failure of Cluster Systems has proven to be of adverse effect and it can be costly. System administr...
Diagnosing and correcting failures in complex, distributed systems is difficult. In a network of per...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
This paper introduces a failure analysis procedure that underpins real-time fault prognosis. In the ...
Failures have expensive implications in HPC (High- Performance Computing) systems. Consequently, eff...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
The level of trust on log-based dependability characterization of complex distributed systems, is bi...
Research in the field of failure log analysis shows that spatial and temporal patterns exist among e...
The ability to automatically detect faults or fault patterns to enhance system reliability is import...
Failure analysis is valuable to dependability engineers because it supports designing effective miti...
Log preprocessing, a process applied on the raw log be-fore applying a predictive method, is of para...
System logs are the first source of information available to system designers to analyze and trouble...
Abstract — System logs are an important tool in studying the conditions (e.g., environment misconfig...
Software faults are recognized to be among the main responsible for system failures in many applicat...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
Failure of Cluster Systems has proven to be of adverse effect and it can be costly. System administr...
Diagnosing and correcting failures in complex, distributed systems is difficult. In a network of per...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
This paper introduces a failure analysis procedure that underpins real-time fault prognosis. In the ...
Failures have expensive implications in HPC (High- Performance Computing) systems. Consequently, eff...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...