The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. In the near future, it is expected that the mean time between failures of HPC systems becomes too short and that current failure recovery mechanisms will no longer be able to recover the systems from failures. Early failure detection is, thus, essential to prevent their destructive effects. Based on measurements of a production system at TU Dresden over an 8-month time period, we study the correlation of node failures in time and space. We infer possible types of correlations and show that in many cases the observed node failures are directly correlated. The significance of such a study is achieving a clearer understanding of correlations betw...
Resource failures and down times have become a growing concern for large-scale computational platfor...
System logs are the first source of information available to system designers to analyze and trouble...
Failure diagnosis for large compute clusters using only message logs is known to be incomplete. Rece...
In this paper we study the correlation of node failures in time and space. Our study is based on mea...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
System logs are the first source of information available to system designers to analyze and trouble...
In response to the demand for higher computational power, the number of computing nodes in high perf...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
Designing highly dependable systems requires a good understanding of failure characteristics. Unfort...
Large supercomputers are composed of numerous components that risk to break down or behave in unwant...
Resource failures and down times have become a growing concern for large-scale computational platfor...
System logs are the first source of information available to system designers to analyze and trouble...
Failure diagnosis for large compute clusters using only message logs is known to be incomplete. Rece...
In this paper we study the correlation of node failures in time and space. Our study is based on mea...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
System logs are the first source of information available to system designers to analyze and trouble...
In response to the demand for higher computational power, the number of computing nodes in high perf...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
Designing highly dependable systems requires a good understanding of failure characteristics. Unfort...
Large supercomputers are composed of numerous components that risk to break down or behave in unwant...
Resource failures and down times have become a growing concern for large-scale computational platfor...
System logs are the first source of information available to system designers to analyze and trouble...
Failure diagnosis for large compute clusters using only message logs is known to be incomplete. Rece...