Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution. During operation, several types of fault conditions or anomalies can arise, ranging from malfunctioning hardware to improper configurations or imperfect software. Currently, system administrator and final users have to discover it manually. Clearly this approach does not scale to large scale supercomputers and facilities: automated methods to detect faults and unhealthy conditions is needed. Our method uses a type of neural network called autoencoder trained to learn the normal behavior of a real, in-production HPC system and it is deployed on the edge of each computing node. We obtain a very good accuracy (values ranging between 90% and 95...
Networked computer systems continue to grow in scale and in the complexity of their components and i...
Automated and data-driven methodologies are being introduced to assist system administrators in mana...
The Antarex dataset contains trace data collected from the homonymous experimental HPC system locate...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
High Performance Computing (HPC) systems are complex machines with heterogeneous components that can...
Anomaly detection in supercomputers is a very difficult problem due to the big scale of the systems ...
As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that t...
In their quest toward Exascale, High Performance Computing (HPC) systems are rapidly becoming larger...
Large supercomputers are composed of numerous components that risk to break down or behave in unwant...
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at ...
In response to the demand for higher computational power, the number of computing nodes in high perf...
This data set contains the data collected on the DAVIDE HPC system (CINECA & E4 & University of Bolo...
Anomaly detection is the identification of events or observations that deviate from the expected beh...
Anomaly detection systems are vital in ensuring the availability of modern High-Performance Computin...
The increasing complexity of modern high-performance computing (HPC) systems necessitates the introd...
Networked computer systems continue to grow in scale and in the complexity of their components and i...
Automated and data-driven methodologies are being introduced to assist system administrators in mana...
The Antarex dataset contains trace data collected from the homonymous experimental HPC system locate...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
High Performance Computing (HPC) systems are complex machines with heterogeneous components that can...
Anomaly detection in supercomputers is a very difficult problem due to the big scale of the systems ...
As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that t...
In their quest toward Exascale, High Performance Computing (HPC) systems are rapidly becoming larger...
Large supercomputers are composed of numerous components that risk to break down or behave in unwant...
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at ...
In response to the demand for higher computational power, the number of computing nodes in high perf...
This data set contains the data collected on the DAVIDE HPC system (CINECA & E4 & University of Bolo...
Anomaly detection is the identification of events or observations that deviate from the expected beh...
Anomaly detection systems are vital in ensuring the availability of modern High-Performance Computin...
The increasing complexity of modern high-performance computing (HPC) systems necessitates the introd...
Networked computer systems continue to grow in scale and in the complexity of their components and i...
Automated and data-driven methodologies are being introduced to assist system administrators in mana...
The Antarex dataset contains trace data collected from the homonymous experimental HPC system locate...