As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which is the deliberate triggering of faults in a system so as to observe their behavior in a controlled environment. In this paper, we propose a fault classification method for HPC systems based on machine learning. The novelty of our approach rests with the fact that it can be operated on streamed data in an online manner, thus opening the possibility t...
HPC systems are widely used in industrial, economical, and scientific applications, and many of thes...
2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as da...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at ...
As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that t...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
The Antarex dataset contains trace data collected from the homonymous experimental HPC system locate...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
Supercomputers have played an essential role in the progress of science and engineering research. As...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
Abstract. Large-scale computing platforms provide tremendous capabilities for scientific discovery. ...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Civil and Environmental Enginee...
With the massive adoption of machine learning (ML) applications in HPC domains, the reliability of M...
International audienceResilience/fault-tolerance has become a key challenge for large-scale parallel...
HPC systems are widely used in industrial, economical, and scientific applications, and many of thes...
2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as da...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at ...
As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that t...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
The Antarex dataset contains trace data collected from the homonymous experimental HPC system locate...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
Supercomputers have played an essential role in the progress of science and engineering research. As...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
Abstract. Large-scale computing platforms provide tremendous capabilities for scientific discovery. ...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Civil and Environmental Enginee...
With the massive adoption of machine learning (ML) applications in HPC domains, the reliability of M...
International audienceResilience/fault-tolerance has become a key challenge for large-scale parallel...
HPC systems are widely used in industrial, economical, and scientific applications, and many of thes...
2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as da...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...