We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, with a focus on the management of complex experiments. FINJ provides support for custom workloads and allows generation of anomalous conditions through the use of fault-triggering executable programs. FINJ can also be integrated seamlessly with most other lower-level fault injection tools, allowing users to create and monitor a variety of highly-complex and diverse fault conditions in HPC systems that would be difficult to recreate in practice. FINJ is suitable for experiments involving many, potentially interacting nodes, making it a very versatile design and evaluation tool
textComputer systems, even when correctly designed, can suffer from temporary errors due to radiatio...
Our society is faced with an increasing dependence on computing systems, not only in high tech consu...
Hardware fault injection is the widely accepted approach to evaluate the behavior of a circuit in t...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
open6siA. Netti has been supported by the EU project Oprecomp-Open Transprecision Computing (grant a...
International audienceResilience/fault-tolerance has become a key challenge for large-scale parallel...
Supercomputers have played an essential role in the progress of science and engineering research. As...
The Antarex dataset contains trace data collected from the homonymous experimental HPC system locate...
Abstract. Large-scale computing platforms provide tremendous capabilities for scientific discovery. ...
Dependability evaluation involves the study of failures and errors. The destructive nature of a cras...
<p>This thesis deals with techniques for designing and evaluating error detection and recovery mecha...
The paper introduces FTAPE (Fault Tolerance And Performance Evaluator), a tool that can be used to c...
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Com...
This paper describes FTAPE (Fault Tolerance And Performance Evaluator), a tool that can be used to c...
textComputer systems, even when correctly designed, can suffer from temporary errors due to radiatio...
Our society is faced with an increasing dependence on computing systems, not only in high tech consu...
Hardware fault injection is the widely accepted approach to evaluate the behavior of a circuit in t...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, wit...
open6siA. Netti has been supported by the EU project Oprecomp-Open Transprecision Computing (grant a...
International audienceResilience/fault-tolerance has become a key challenge for large-scale parallel...
Supercomputers have played an essential role in the progress of science and engineering research. As...
The Antarex dataset contains trace data collected from the homonymous experimental HPC system locate...
Abstract. Large-scale computing platforms provide tremendous capabilities for scientific discovery. ...
Dependability evaluation involves the study of failures and errors. The destructive nature of a cras...
<p>This thesis deals with techniques for designing and evaluating error detection and recovery mecha...
The paper introduces FTAPE (Fault Tolerance And Performance Evaluator), a tool that can be used to c...
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Com...
This paper describes FTAPE (Fault Tolerance And Performance Evaluator), a tool that can be used to c...
textComputer systems, even when correctly designed, can suffer from temporary errors due to radiatio...
Our society is faced with an increasing dependence on computing systems, not only in high tech consu...
Hardware fault injection is the widely accepted approach to evaluate the behavior of a circuit in t...