Abstract — System- and application-level failures can be characterized by mining relevant log files and performing statistical analysis on the provided information. The resulting data may then be used in any number of future developments and studies on the corresponding computational architecture, including fields such as failure prediction, fault tolerance, performance modelling and power awareness. This paper provides a statistical analysis of the application- and system-level failures encountered and logged by the IBM Blue Gene/L supercomputing system over a six month period. I
Manually diagnosing recurrent faults in software systems can be an inefficient use of time for engin...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
The growing computational and storage needs of scientific applications mandate the deployment of ext...
With the growth of system size and complexity, reliability has become a major concern for large-scal...
Failure analysis is valuable to dependability engineers because it supports designing effective miti...
The level of trust on log-based dependability characterization of complex distributed systems, is bi...
In response to the demand for higher computational power, the number of computing nodes in high perf...
The growing computational and storage needs of scientific applications mandate the deploy-ment of ex...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
The complexity and cost of managing high-performance computing infrastructures are on the rise. Auto...
System logs are the first source of information available to system designers to analyze and trouble...
Manually diagnosing recurrent faults in software systems can be an inefficient use of time for engin...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
The growing computational and storage needs of scientific applications mandate the deployment of ext...
With the growth of system size and complexity, reliability has become a major concern for large-scal...
Failure analysis is valuable to dependability engineers because it supports designing effective miti...
The level of trust on log-based dependability characterization of complex distributed systems, is bi...
In response to the demand for higher computational power, the number of computing nodes in high perf...
The growing computational and storage needs of scientific applications mandate the deploy-ment of ex...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
The complexity and cost of managing high-performance computing infrastructures are on the rise. Auto...
System logs are the first source of information available to system designers to analyze and trouble...
Manually diagnosing recurrent faults in software systems can be an inefficient use of time for engin...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...