The Antarex dataset contains trace data collected from the homonymous experimental HPC system located at ETH Zurich while it was subjected to fault injection, for the purpose of conducting machine learning-based fault detection studies for HPC systems. Acquiring our own dataset was made necessary by the fact that commercial HPC system operators are very reluctant to share trace data containing information about faults in their systems. In order to acquire data, we executed benchmark applications and at the same time injected faults in the system at specific times via dedicated programs, so as to trigger anomalies in the behaviour of the applications. A wide range of faults is covered in our dataset, from hardware faults, to misconfiguratio...
Large supercomputers are composed of numerous components that risk to break down or behave in unwant...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
The Antarex dataset contains trace data collected from the homonymous experimental HPC system locate...
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at ...
As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that t...
Supercomputers have played an essential role in the progress of science and engineering research. As...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
HPC-ODA is a collection of datasets acquired on production HPC systems, which are representative of ...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
This data set contains the data collected on the DAVIDE HPC system (CINECA & E4 & University of Bolo...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
Large supercomputers are composed of numerous components that risk to break down or behave in unwant...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
The Antarex dataset contains trace data collected from the homonymous experimental HPC system locate...
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at ...
As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that t...
Supercomputers have played an essential role in the progress of science and engineering research. As...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
HPC-ODA is a collection of datasets acquired on production HPC systems, which are representative of ...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
This data set contains the data collected on the DAVIDE HPC system (CINECA & E4 & University of Bolo...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
Large supercomputers are composed of numerous components that risk to break down or behave in unwant...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...