With the advent of resource-hungry applications such as scientific simulations and artificial intelligence (AI), the need for high-performance computing (HPC) infrastructure is becoming more pressing. HPC systems are typically characterised by the scale of the resources they possess, containing a large number of sophisticated HW components that are tightly integrated. This scale and design complexity inherently contribute to sources of uncertainties, i.e., there are dependability threats that perturb the system during application execution. During system execution, these HPC systems generate a massive amount of log messages that capture the health status of the various components. Several previous works have leveraged those systems’ logs fo...
Today’s large-scale systems such as High Performance Computing (HPC) Systems are designed/utilized t...
Large-scale clusters are growing at a rapid pace, and the resulting amount of monitoring data produc...
With the enormous number of computing resources in HPC and Cloud systems, failures become a major co...
System failures are expected to be frequent in the exascale era such as current Petascale systems. T...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
The need for computer systems to be reliable has increasingly become important as the dependence on ...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As l...
As society becomes more dependent upon computer systems to perform increasingly critical tasks, ensu...
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at ...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As la...
Supercomputers have played an essential role in the progress of science and engineering research. As...
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. I...
Today’s large-scale systems such as High Performance Computing (HPC) Systems are designed/utilized t...
Large-scale clusters are growing at a rapid pace, and the resulting amount of monitoring data produc...
With the enormous number of computing resources in HPC and Cloud systems, failures become a major co...
System failures are expected to be frequent in the exascale era such as current Petascale systems. T...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
The need for computer systems to be reliable has increasingly become important as the dependence on ...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As l...
As society becomes more dependent upon computer systems to perform increasingly critical tasks, ensu...
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at ...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As la...
Supercomputers have played an essential role in the progress of science and engineering research. As...
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. I...
Today’s large-scale systems such as High Performance Computing (HPC) Systems are designed/utilized t...
Large-scale clusters are growing at a rapid pace, and the resulting amount of monitoring data produc...
With the enormous number of computing resources in HPC and Cloud systems, failures become a major co...