Large-scale computing systems provide great po-tential for scientific exploration. However, the complex-ity that accompanies these enormous machines raises challeges for both, users and operators. The effec-tive use of such systems is often hampered by fail-ures encountered when running applications on systems containing tens-of-thousands of nodes and hundreds-of-thousands of compute cores capable of yielding petaflops of performance. In systems of this size fail-ure detection is complicated and root-cause diagnosis difficult. This paper describes our recent work in the identification of anomalies in monitoring data and sys-tem logs to provide further insights into machine status, runtime behavior, failure modes and failure root causes. It ...
The Antarex dataset contains trace data collected from the homonymous experimental HPC system locate...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
The demands of increasingly large scientific application workflows lead to the need for more powerfu...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
In response to the demand for higher computational power, the number of computing nodes in high perf...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
A large percentage of computing capacity in todays large high-performance computing systems is waste...
With the growth of system size and complexity, reliability has become a major concern for large-scal...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
Embedded systems suffer from reliability issues such as variations in temperature and voltage, singl...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. I...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
Failures have expensive implications in HPC (High- Performance Computing) systems. Consequently, eff...
The Antarex dataset contains trace data collected from the homonymous experimental HPC system locate...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
The demands of increasingly large scientific application workflows lead to the need for more powerfu...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
In response to the demand for higher computational power, the number of computing nodes in high perf...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
A large percentage of computing capacity in todays large high-performance computing systems is waste...
With the growth of system size and complexity, reliability has become a major concern for large-scal...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
Embedded systems suffer from reliability issues such as variations in temperature and voltage, singl...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. I...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
Failures have expensive implications in HPC (High- Performance Computing) systems. Consequently, eff...
The Antarex dataset contains trace data collected from the homonymous experimental HPC system locate...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
The demands of increasingly large scientific application workflows lead to the need for more powerfu...