The increasing complexity of modern high-performance computing (HPC) systems necessitates the introduction of automated and data-driven methodologies to support system administrators’ effort towards increasing the system's availability. Anomaly detection is an integral part of improving the availability as it eases the system administrator's burden and reduces the time between an anomaly and its resolution. However, current state-of-the-art (SOTA) approaches to anomaly detection are supervised and semi-supervised, so they require a human-labelled dataset with anomalies — this is often impractical to collect in production HPC systems. Unsupervised anomaly detection approaches based on clustering, aimed at alleviating the need for accurate an...
Abstract — A critical problem facing by managing large-scale clusters is to identify the location of...
The only way for the world to move into the bright future is to move from nonrenewable resources in...
This demo paper presents a design and implementation of a system AnomalyKiTS for detecting anomalies...
The increasing complexity of modern high-performance computing (HPC) systems necessitates the introd...
Automated and data-driven methodologies are being introduced to assist system administrators in mana...
High Performance Computing (HPC) systems are complex machines with heterogeneous components that can...
In response to the demand for higher computational power, the number of computing nodes in high perf...
This data set contains the data collected on the DAVIDE HPC system (CINECA & E4 & University of Bolo...
The IHEP local cluster is a middle-sized HEP data center which consists of 20'000 CPU slots, hundred...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
In a Real Time Clearing System (RTCS) there are several thousands of transactions per second, and ev...
— Monitoring resources in a server environment is an essential and indispensable process that ensur...
Anomaly detection algorithms solve the problem of identifying unexpected values in data sets. Such a...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
A methodology as well as a suggested solution to the problem of unsupervised anomaly detection for c...
Abstract — A critical problem facing by managing large-scale clusters is to identify the location of...
The only way for the world to move into the bright future is to move from nonrenewable resources in...
This demo paper presents a design and implementation of a system AnomalyKiTS for detecting anomalies...
The increasing complexity of modern high-performance computing (HPC) systems necessitates the introd...
Automated and data-driven methodologies are being introduced to assist system administrators in mana...
High Performance Computing (HPC) systems are complex machines with heterogeneous components that can...
In response to the demand for higher computational power, the number of computing nodes in high perf...
This data set contains the data collected on the DAVIDE HPC system (CINECA & E4 & University of Bolo...
The IHEP local cluster is a middle-sized HEP data center which consists of 20'000 CPU slots, hundred...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
In a Real Time Clearing System (RTCS) there are several thousands of transactions per second, and ev...
— Monitoring resources in a server environment is an essential and indispensable process that ensur...
Anomaly detection algorithms solve the problem of identifying unexpected values in data sets. Such a...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
A methodology as well as a suggested solution to the problem of unsupervised anomaly detection for c...
Abstract — A critical problem facing by managing large-scale clusters is to identify the location of...
The only way for the world to move into the bright future is to move from nonrenewable resources in...
This demo paper presents a design and implementation of a system AnomalyKiTS for detecting anomalies...