In their quest toward Exascale, High Performance Computing (HPC) systems are rapidly becoming larger and more complex, together with the issues concerning their maintenance. Luckily, many current HPC systems are endowed with data monitoring infrastructures that characterize the system state, and whose data can be used to train Deep Learning (DL) anomaly detection models, a very popular research area. However, the lack of labels describing the state of the system is a wide-spread issue, as annotating data is a costly task, generally falling on human system administrators and thus does not scale toward exascale. In this article we investigate the possibility to extract labels from a service monitoring tool (Nagios) currently used by HPC syste...
The main goal of this research is to contribute to automated performance anomaly detection for large...
Modern scientific discoveries are driven by an unsatisfiable demand for computational resources. Hig...
Anomaly detection is the identification of events or observations that deviate from the expected beh...
In their quest toward Exascale, High Performance Computing (HPC) systems are rapidly becoming larger...
Anomaly detection systems are vital in ensuring the availability of modern High-Performance Computin...
Large supercomputers are composed of numerous components that risk to break down or behave in unwant...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
In response to the demand for higher computational power, the number of computing nodes in high perf...
Modern scientific discoveries are driven by an unsatisfiable demand for computational resources. To ...
Anomaly detection in supercomputers is a very difficult problem due to the big scale of the systems ...
High Performance Computing (HPC) systems are complex machines with heterogeneous components that can...
High-performance Computing (HPC) systems play pivotal roles in societal and scientific advancements,...
Proceeding of: IEEE 5th International Conference on Big Data Security on Cloud (BigDataSecurity), 27...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
The main goal of this research is to contribute to automated performance anomaly detection for large...
Modern scientific discoveries are driven by an unsatisfiable demand for computational resources. Hig...
Anomaly detection is the identification of events or observations that deviate from the expected beh...
In their quest toward Exascale, High Performance Computing (HPC) systems are rapidly becoming larger...
Anomaly detection systems are vital in ensuring the availability of modern High-Performance Computin...
Large supercomputers are composed of numerous components that risk to break down or behave in unwant...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
In response to the demand for higher computational power, the number of computing nodes in high perf...
Modern scientific discoveries are driven by an unsatisfiable demand for computational resources. To ...
Anomaly detection in supercomputers is a very difficult problem due to the big scale of the systems ...
High Performance Computing (HPC) systems are complex machines with heterogeneous components that can...
High-performance Computing (HPC) systems play pivotal roles in societal and scientific advancements,...
Proceeding of: IEEE 5th International Conference on Big Data Security on Cloud (BigDataSecurity), 27...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
The main goal of this research is to contribute to automated performance anomaly detection for large...
Modern scientific discoveries are driven by an unsatisfiable demand for computational resources. Hig...
Anomaly detection is the identification of events or observations that deviate from the expected beh...