open6siA. Netti has been supported by the EU project Oprecomp-Open Transprecision Computing (grant agreement 732631). A. Sîrbu has been partially funded by the EU project SoBigData Research Infrastructure — Big Data and Social Mining Ecosystem (grant agreement 654024).As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which is the deliberate triggering of faults in a system so as to observe their behavior...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Civil and Environmental Enginee...
HPC systems are widely used in industrial, economical, and scientific applications, and many of thes...
Virtual platform frameworks have been extended to allow earlier soft error analysis of more realisti...
open6siA. Netti has been supported by the EU project Oprecomp-Open Transprecision Computing (grant a...
As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that t...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
The Antarex dataset contains trace data collected from the homonymous experimental HPC system locate...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
Supercomputers have played an essential role in the progress of science and engineering research. As...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
Abstract. Large-scale computing platforms provide tremendous capabilities for scientific discovery. ...
International audienceResilience/fault-tolerance has become a key challenge for large-scale parallel...
An HPC system, a system with much more computational power than general computing systems, is a comp...
With the massive adoption of machine learning (ML) applications in HPC domains, the reliability of M...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Civil and Environmental Enginee...
HPC systems are widely used in industrial, economical, and scientific applications, and many of thes...
Virtual platform frameworks have been extended to allow earlier soft error analysis of more realisti...
open6siA. Netti has been supported by the EU project Oprecomp-Open Transprecision Computing (grant a...
As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that t...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
The Antarex dataset contains trace data collected from the homonymous experimental HPC system locate...
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution...
Supercomputers have played an essential role in the progress of science and engineering research. As...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
Abstract. Large-scale computing platforms provide tremendous capabilities for scientific discovery. ...
International audienceResilience/fault-tolerance has become a key challenge for large-scale parallel...
An HPC system, a system with much more computational power than general computing systems, is a comp...
With the massive adoption of machine learning (ML) applications in HPC domains, the reliability of M...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Civil and Environmental Enginee...
HPC systems are widely used in industrial, economical, and scientific applications, and many of thes...
Virtual platform frameworks have been extended to allow earlier soft error analysis of more realisti...