With the increasing scale and complexity of high performance computing (HPC) systems, reliability management is becoming a major concern. System logs are the primary source of information to understand and analyze system problems. Nevertheless, manual log processing is time-consuming, error-prone, and not scalable. Currently little study has been done on automated log analysis for practical use in HPC systems. In this thesis, we present a log analysis infrastructure by exploiting data mining and machine learning technologies. Our work can be broadly divided into four parts: log pre-processing, online failure prediction, automatic root cause diagnosis, and reliability modeling. We evaluate our results by means of system logs collected from p...
The size and complexity of cloud computing systems makes runtime errors inevitable. These errors cou...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
Many problems exist in the testing of a large scale system. The automated testing results are not re...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
System logs are the first source of information available to system designers to analyze and trouble...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
With the growth of system size and complexity, reliability has become a major concern for large-scal...
Part 4: Applications of Parallel and Distributed ComputingInternational audienceIn modern computer s...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
Large-scale clusters are growing at a rapid pace, and the resulting amount of monitoring data produc...
Manually diagnosing recurrent faults in software systems can be an inefficient use of time for engin...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As la...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
Failure analysis is valuable to dependability engineers because it supports designing effective miti...
System logs are the first source of information available to system designers to analyze and trouble...
The size and complexity of cloud computing systems makes runtime errors inevitable. These errors cou...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
Many problems exist in the testing of a large scale system. The automated testing results are not re...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
System logs are the first source of information available to system designers to analyze and trouble...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
With the growth of system size and complexity, reliability has become a major concern for large-scal...
Part 4: Applications of Parallel and Distributed ComputingInternational audienceIn modern computer s...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
Large-scale clusters are growing at a rapid pace, and the resulting amount of monitoring data produc...
Manually diagnosing recurrent faults in software systems can be an inefficient use of time for engin...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As la...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
Failure analysis is valuable to dependability engineers because it supports designing effective miti...
System logs are the first source of information available to system designers to analyze and trouble...
The size and complexity of cloud computing systems makes runtime errors inevitable. These errors cou...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
Many problems exist in the testing of a large scale system. The automated testing results are not re...