Large supercomputers are composed of numerous components that risk to break down or behave in unwanted manners. Identifying broken components is a daunting task for system administrators. Hence an automated tool would be a boon for the systems resiliency. The wealth of data available in a supercomputer can be used for this task. In this work we propose an approach to take advantage of holistic data centre monitoring, system administrator node status labeling and an explainable model for fault detection in supercomputing nodes. The proposed model aims at classifying the different states of the computing nodes thanks to the labeled data describing the supercomputer behaviour, data which is typically collected by system administrators but not ...
The need for computer systems to be reliable has increasingly become important as the dependence on ...
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at ...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
Large supercomputers are composed of numerous components that risk to break down or behave in unwant...
In their quest toward Exascale, High Performance Computing (HPC) systems are rapidly becoming larger...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
Anomaly detection systems are vital in ensuring the availability of modern High-Performance Computin...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
In response to the demand for higher computational power, the number of computing nodes in high perf...
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. I...
As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that t...
Enterprise and high-performance computing systems are growing extremely large and complex, employing...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
The need for computer systems to be reliable has increasingly become important as the dependence on ...
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at ...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
Large supercomputers are composed of numerous components that risk to break down or behave in unwant...
In their quest toward Exascale, High Performance Computing (HPC) systems are rapidly becoming larger...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
Anomaly detection systems are vital in ensuring the availability of modern High-Performance Computin...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
In response to the demand for higher computational power, the number of computing nodes in high perf...
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. I...
As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that t...
Enterprise and high-performance computing systems are growing extremely large and complex, employing...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
The need for computer systems to be reliable has increasingly become important as the dependence on ...
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at ...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...