The Antarex dataset contains trace data collected from the homonymous experimental HPC system located at ETH Zurich while it was subjected to fault injection, for the purpose of conducting machine learning-based fault detection studies for HPC systems. Acquiring our own dataset was made necessary by the fact that commercial HPC system operators are very reluctant to share trace data containing information about faults in their systems. In order to acquire data, we executed benchmark applications and at the same time injected faults in the system at specific times via dedicated programs, so as to trigger anomalies in the behaviour of the applications. A wide range of faults is covered in our dataset, from hardware faults, to misconfiguratio...
As the scale of High-Performance Computing (HPC) clusters continues to grow, their increasing failur...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
The Antarex dataset contains trace data collected from the homonymous experimental HPC system locate...
open6siA. Netti has been supported by the EU project Oprecomp-Open Transprecision Computing (grant a...
Supercomputers have played an essential role in the progress of science and engineering research. As...
As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that t...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
HPC-ODA is a collection of datasets acquired on production HPC systems, which are representative of ...
This data set contains the data collected on the DAVIDE HPC system (CINECA & E4 & University of Bolo...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
International audienceResilience/fault-tolerance has become a key challenge for large-scale parallel...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
As the scale of High-Performance Computing (HPC) clusters continues to grow, their increasing failur...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
The Antarex dataset contains trace data collected from the homonymous experimental HPC system locate...
open6siA. Netti has been supported by the EU project Oprecomp-Open Transprecision Computing (grant a...
Supercomputers have played an essential role in the progress of science and engineering research. As...
As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that t...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
HPC-ODA is a collection of datasets acquired on production HPC systems, which are representative of ...
This data set contains the data collected on the DAVIDE HPC system (CINECA & E4 & University of Bolo...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
International audienceResilience/fault-tolerance has become a key challenge for large-scale parallel...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
As the scale of High-Performance Computing (HPC) clusters continues to grow, their increasing failur...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...