International audienceMany methods are available to detect silent errors in high-performance computing (HPC) applications. Each method comes with a cost, a recall (fraction of all errors that are actually detected, i.e., false negatives), and a precision (fraction of true errors amongst all detected errors, i.e., false positives). The main contribution of this paper is to characterize the optimal computing pattern for an application: which detector(s) to use, how many detectors of each type to use, together with the length of the work segment that precedes each of them. We first prove that detectors with imperfect precisions offer limited usefulness. Then we focus on detectors with perfect precision , and we conduct a comprehensive complexi...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
International audienceThis paper investigates the optimal number of processors to execute a parallel...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceMany methods are available to detect silent errors in high-performance computi...
Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) app...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
Resilient algorithms in high-performance computing are subject to rigorous non-functional constrain...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
International audienceThis paper investigates the optimal number of processors to execute a parallel...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceMany methods are available to detect silent errors in high-performance computi...
Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) app...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
Resilient algorithms in high-performance computing are subject to rigorous non-functional constrain...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
International audienceThis paper investigates the optimal number of processors to execute a parallel...