Large-scale computing systems provide great potential for scientific exploration. However, the complexity that accompanies these enormous machines raises challenges for both, users and operators. The effective use of such systems is often hampered by failures encountered when running applications on systems containing tens-of-thousands of nodes and hundreds-of-thousands of compute cores capable of yielding petaflops of performance. In systems of this size failure detection is complicated and root-cause diagnosis difficult. This paper describes our recent work in the identification of anomalies in monitoring data and system logs to provide further insights into machine status, runtime behavior, failure modes and failure root causes. It discu...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
Anomaly detection is the process of discovering some anomalous behaviour in the real-time operation ...
Failures have expensive implications in HPC (High- Performance Computing) systems. Consequently, eff...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
In response to the demand for higher computational power, the number of computing nodes in high perf...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
Embedded systems suffer from reliability issues such as variations in temperature and voltage, singl...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
With the growth of system size and complexity, reliability has become a major concern for large-scal...
A large percentage of computing capacity in todays large high-performance computing systems is waste...
Automated and data-driven methodologies are being introduced to assist system administrators in mana...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. I...
As the volume of data recorded from systems increases, there is a need to effectively analyse this d...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
Anomaly detection is the process of discovering some anomalous behaviour in the real-time operation ...
Failures have expensive implications in HPC (High- Performance Computing) systems. Consequently, eff...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
In response to the demand for higher computational power, the number of computing nodes in high perf...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
Embedded systems suffer from reliability issues such as variations in temperature and voltage, singl...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
With the growth of system size and complexity, reliability has become a major concern for large-scal...
A large percentage of computing capacity in todays large high-performance computing systems is waste...
Automated and data-driven methodologies are being introduced to assist system administrators in mana...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. I...
As the volume of data recorded from systems increases, there is a need to effectively analyse this d...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
Anomaly detection is the process of discovering some anomalous behaviour in the real-time operation ...
Failures have expensive implications in HPC (High- Performance Computing) systems. Consequently, eff...