The demands of increasingly large scientific application workflows lead to the need for more powerful supercomputers. As the scale of supercomputing systems have grown, the prediction of fault tolerance has become an increasingly critical area of study, since the prediction of system failures can improve performance by saving checkpoints in advance. We propose a real-time failure detection algorithm that adopts an event-based prediction model. The prediction model is a convolutional neural network that utilizes both traditional event attributes and additional spatio-temporal features. We present a case study using our proposed method with six years of reliability, availability, and serviceability event logs recorded by Mira, a Blue Gene/Q s...
In modern data centers, storage system failures are major contributors to downtimes and maintenance ...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
Resilience is an important challenge for extreme-scale supercomputers. Failures in current supercomp...
A large percentage of computing capacity in todays large high-performance computing systems is waste...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
Abstract—To facilitate proactive fault management in large-scale systems such as IBM Blue Gene/P, on...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As la...
If machine failures can be detected preemptively, then maintenance and repairs can be performed more...
AbstractMost fault detection systems (FDS) have proved their efficiency in the detection of anomalie...
Analyzing, understanding and predicting failure is of paramount importance to achieve effective faul...
The growing computational and storage needs of scientific applications mandate the deployment of ext...
This paper introduces a failure analysis procedure that underpins real-time fault prognosis. In the ...
We focus on machine failure prediction in industry 4.0.Indeed, it is used for classification problem...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
In modern data centers, storage system failures are major contributors to downtimes and maintenance ...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
Resilience is an important challenge for extreme-scale supercomputers. Failures in current supercomp...
A large percentage of computing capacity in todays large high-performance computing systems is waste...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
Abstract—To facilitate proactive fault management in large-scale systems such as IBM Blue Gene/P, on...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As la...
If machine failures can be detected preemptively, then maintenance and repairs can be performed more...
AbstractMost fault detection systems (FDS) have proved their efficiency in the detection of anomalie...
Analyzing, understanding and predicting failure is of paramount importance to achieve effective faul...
The growing computational and storage needs of scientific applications mandate the deployment of ext...
This paper introduces a failure analysis procedure that underpins real-time fault prognosis. In the ...
We focus on machine failure prediction in industry 4.0.Indeed, it is used for classification problem...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
In modern data centers, storage system failures are major contributors to downtimes and maintenance ...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
Resilience is an important challenge for extreme-scale supercomputers. Failures in current supercomp...