System logs are the first source of information available to system designers to analyze and troubleshoot their cluster systems. For example, High-Performance Computing (HPC) systems generate a large volume of heterogeneous data from multiple sub-systems, so the idea of using a single source of data to achieve a given goal, such as identification of failures, is losing its validity. System log-analysis tools assist system designers gain understanding into a large volume of system logs. They enable system designers to perform various analyses (e.g., diagnosing node failures or predicting node failures). Current system log-analysis tools vary significantly in their function and design. We conduct a systematic review of literature on system lo...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
The article analyzes the paths and algorithms for automating the monitoring of computer system state...
ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their constructive feedba...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
Abstract—This paper presents a methodology and a system, named LogMaster, for mining correlations of...
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. I...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
Failure diagnosis for large compute clusters using only message logs is known to be incomplete. Rece...
The level of trust on log-based dependability characterization of complex distributed systems, is bi...
The ability to automatically detect faults or fault patterns to enhance system reliability is import...
System failures are expected to be frequent in the exascale era such as current Petascale systems. T...
Part 4: Applications of Parallel and Distributed ComputingInternational audienceIn modern computer s...
Failure of Cluster Systems has proven to be of adverse effect and it can be costly. System administr...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
The article analyzes the paths and algorithms for automating the monitoring of computer system state...
ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their constructive feedba...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
Abstract—This paper presents a methodology and a system, named LogMaster, for mining correlations of...
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. I...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
Failure diagnosis for large compute clusters using only message logs is known to be incomplete. Rece...
The level of trust on log-based dependability characterization of complex distributed systems, is bi...
The ability to automatically detect faults or fault patterns to enhance system reliability is import...
System failures are expected to be frequent in the exascale era such as current Petascale systems. T...
Part 4: Applications of Parallel and Distributed ComputingInternational audienceIn modern computer s...
Failure of Cluster Systems has proven to be of adverse effect and it can be costly. System administr...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
The article analyzes the paths and algorithms for automating the monitoring of computer system state...