In response to the demand for higher computational power, the number of computing nodes in high performance computers (HPC) increases rapidly. Exascale HPC systems are expected to arrive by 2020. With drastic increase in the number of HPC system components, it is expected to observe a sudden increase in the number of failures which, consequently, poses a threat to the continuous operation of the HPC systems. Detecting failures as early as possible and, ideally, predicting them, is a necessary step to avoid interruptions in HPC systems operation. Anomaly detection is a well-known general purpose approach for failure detection, in computing systems. The majority of existing methods are designed for specific architectures, require adjustments ...
Abstract — A critical problem facing by managing large-scale clusters is to identify the location of...
The impact of an anomaly is domain-dependent. In a dataset of network activities, an anomaly can imp...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
In their quest toward Exascale, High Performance Computing (HPC) systems are rapidly becoming larger...
Automated and data-driven methodologies are being introduced to assist system administrators in mana...
High Performance Computing (HPC) systems are complex machines with heterogeneous components that can...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
Embedded systems suffer from reliability issues such as variations in temperature and voltage, singl...
Modern scientific discoveries are driven by an unsatisfiable demand for computational resources. To ...
The increasing complexity of modern high-performance computing (HPC) systems necessitates the introd...
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. I...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
Abstract — A critical problem facing by managing large-scale clusters is to identify the location of...
The impact of an anomaly is domain-dependent. In a dataset of network activities, an anomaly can imp...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
In their quest toward Exascale, High Performance Computing (HPC) systems are rapidly becoming larger...
Automated and data-driven methodologies are being introduced to assist system administrators in mana...
High Performance Computing (HPC) systems are complex machines with heterogeneous components that can...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
Embedded systems suffer from reliability issues such as variations in temperature and voltage, singl...
Modern scientific discoveries are driven by an unsatisfiable demand for computational resources. To ...
The increasing complexity of modern high-performance computing (HPC) systems necessitates the introd...
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. I...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
Abstract — A critical problem facing by managing large-scale clusters is to identify the location of...
The impact of an anomaly is domain-dependent. In a dataset of network activities, an anomaly can imp...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...