Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. Each comes with a given cost and recall (fractionof all errors that are actually detected). The main contribution of this paperis to show which detector(s) to use, and to characterize the optimalcomputational pattern for the application: how many detectors of each type touse, together with the length of the work segment that precedes each of them.We conduct a comprehensive complexity analysis of this optimization problem,showing NP-completeness and designing an FPTAS (Fully Polynomial-TimeApproximation Scheme). On the practical side, we provide a greedy algorithmwhose performance is shown to be close to the optimal for a realistic set ofevalu...
Cette thèse s'intéresse à la résilience pour les applications haute performance à très grande échell...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) app...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceMany methods are available to detect silent errors in high-performance computi...
This report describes a unified framework for the detection and correction of silent errors,which co...
This thesis focuses on resilience for high performance applications that execute on large scale plat...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. ...
Cette thèse s'intéresse à la résilience pour les applications haute performance à très grande échell...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) app...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceMany methods are available to detect silent errors in high-performance computi...
This report describes a unified framework for the detection and correction of silent errors,which co...
This thesis focuses on resilience for high performance applications that execute on large scale plat...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. ...
Cette thèse s'intéresse à la résilience pour les applications haute performance à très grande échell...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...