As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates. For this reason, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures will be essential for continued operation. In this paper, we propose a fault classification method for HPC systems based on machine learning that has been designed specifically to operate with live streamed data. We cast the problem and its solution within realistic operating constraints of online use. Our results show that almost perfect classification accuracy can be reached for different fault types with low computational overhead and minimal de...
The need for computer systems to be reliable has increasingly become important as the dependence on ...
Supercomputers have played an essential role in the progress of science and engineering research. As...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...
As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that t...
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at ...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
The Antarex dataset contains trace data collected from the homonymous experimental HPC system locate...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Civil and Environmental Enginee...
Large supercomputers are composed of numerous components that risk to break down or behave in unwant...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
International audienceBuilding an infrastructure for Exascale applications requires, in addition to ...
High-performance Computing (HPC) systems play pivotal roles in societal and scientific advancements,...
In their quest toward Exascale, High Performance Computing (HPC) systems are rapidly becoming larger...
Resiliency of exascale systems has quickly become an important concern for the scientific community....
The need for computer systems to be reliable has increasingly become important as the dependence on ...
Supercomputers have played an essential role in the progress of science and engineering research. As...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...
As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that t...
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at ...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
The Antarex dataset contains trace data collected from the homonymous experimental HPC system locate...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Civil and Environmental Enginee...
Large supercomputers are composed of numerous components that risk to break down or behave in unwant...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
International audienceBuilding an infrastructure for Exascale applications requires, in addition to ...
High-performance Computing (HPC) systems play pivotal roles in societal and scientific advancements,...
In their quest toward Exascale, High Performance Computing (HPC) systems are rapidly becoming larger...
Resiliency of exascale systems has quickly become an important concern for the scientific community....
The need for computer systems to be reliable has increasingly become important as the dependence on ...
Supercomputers have played an essential role in the progress of science and engineering research. As...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...