System failures are expected to be frequent in the exascale era such as current Petascale systems. The health of such systems is usually determined from challenging analysis of large amounts of unstructured & redundant log data. In this paper, we leverage log data and propose Clairvoyant, a novel self-supervised (i.e., no labels needed) model to predict node failures in HPC systems based on a recent deep learning approach called transformer-decoder and the self-attention mechanism. Clairvoyant predicts node failures by (i) predicting a sequence of log events and then (ii) identifying if a failure is a part of that sequence. We carefully evaluate Clairvoyant and another state-of-the-art failure prediction approach – Desh, based on two real-w...
The use of aircraft operation logs to develop a data-driven model to predict probable failures that ...
Research in the field of failure log analysis shows that spatial and temporal patterns exist among e...
Εθνικό Μετσόβιο Πολυτεχνείο--Μεταπτυχιακή Εργασία. Διεπιστημονικό-Διατμηματικό Πρόγραμμα Μεταπτυχιακ...
With the advent of resource-hungry applications such as scientific simulations and artificial intell...
With the increasing complexity and scope of software systems, their dependability is crucial. The a...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As l...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
System logs are the first source of information available to system designers to analyze and trouble...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As la...
Abstract — System logs are an important tool in studying the conditions (e.g., environment misconfig...
ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their constructive feedba...
The use of aircraft operation logs to develop a data-driven model to predict probable failures that ...
Research in the field of failure log analysis shows that spatial and temporal patterns exist among e...
Εθνικό Μετσόβιο Πολυτεχνείο--Μεταπτυχιακή Εργασία. Διεπιστημονικό-Διατμηματικό Πρόγραμμα Μεταπτυχιακ...
With the advent of resource-hungry applications such as scientific simulations and artificial intell...
With the increasing complexity and scope of software systems, their dependability is crucial. The a...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As l...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
System logs are the first source of information available to system designers to analyze and trouble...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As la...
Abstract — System logs are an important tool in studying the conditions (e.g., environment misconfig...
ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their constructive feedba...
The use of aircraft operation logs to develop a data-driven model to predict probable failures that ...
Research in the field of failure log analysis shows that spatial and temporal patterns exist among e...
Εθνικό Μετσόβιο Πολυτεχνείο--Μεταπτυχιακή Εργασία. Διεπιστημονικό-Διατμηματικό Πρόγραμμα Μεταπτυχιακ...