Failure diagnosis for large compute clusters using only message logs is known to be incomplete. Recent availability of resource use data provides another potentially useful source of data for failure detection and diagnosis. Early work combining message logs and resource use data for failure diagnosis has shown promising results. This paper describes the CRUMEL framework which implements a new approach to combining rationalized message logs and resource use data for failure diagnosis. CRUMEL identifies patterns of errors and resource use and correlates these patterns by time with system failures. Application of CRUMEL to data from the Ranger supercomputer has yielded improved diagnoses over previous research. CRUMEL has: (i) showed that mor...
Abstract—In this paper, we propose an algorithm to effi-ciently diagnose large-scale clustered failu...
Resource-intensive applications such as scientific applications require the architecture or system o...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
Failure of Cluster Systems has proven to be of adverse effect and it can be costly. System administr...
The ability to automatically detect faults or fault patterns to enhance system reliability is import...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
Failures have expensive implications in HPC (High- Performance Computing) systems. Consequently, eff...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. I...
System logs are the rst source of information available to system designers to analyze and troublesh...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
System logs are the first source of information available to system designers to analyze and trouble...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
Large-scale clusters are growing at a rapid pace, and the resulting amount of monitoring data produc...
The era of petascale computing brought machines with hundreds of thousands of processors. The next g...
Abstract—In this paper, we propose an algorithm to effi-ciently diagnose large-scale clustered failu...
Resource-intensive applications such as scientific applications require the architecture or system o...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
Failure of Cluster Systems has proven to be of adverse effect and it can be costly. System administr...
The ability to automatically detect faults or fault patterns to enhance system reliability is import...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
Failures have expensive implications in HPC (High- Performance Computing) systems. Consequently, eff...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. I...
System logs are the rst source of information available to system designers to analyze and troublesh...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
System logs are the first source of information available to system designers to analyze and trouble...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
Large-scale clusters are growing at a rapid pace, and the resulting amount of monitoring data produc...
The era of petascale computing brought machines with hundreds of thousands of processors. The next g...
Abstract—In this paper, we propose an algorithm to effi-ciently diagnose large-scale clustered failu...
Resource-intensive applications such as scientific applications require the architecture or system o...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...