Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. Each comes with a given cost and recall (fractionof all errors that are actually detected). The main contribution of this paperis to show which detector(s) to use, and to characterize the optimalcomputational pattern for the application: how many detectors of each type touse, together with the length of the work segment that precedes each of them.We conduct a comprehensive complexity analysis of this optimization problem,showing NP-completeness and designing an FPTAS (Fully Polynomial-TimeApproximation Scheme). On the practical side, we provide a greedy algorithmwhose performance is shown to be close to the optimal for a realistic set ofevalu...
Cette thèse s'intéresse à la résilience pour les applications haute performance à très grande échell...
Building an infrastructure for exascale applications requires, in addition to many other key compone...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
International audienceMany methods are available to detect silent errors in high-performance computi...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
International audienceMany methods are available to detect silent errors in high-performance computi...
Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) app...
This report describes a unified framework for the detection and correction of silent errors,which co...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
This thesis focuses on resilience for high performance applications that execute on large scale plat...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
Les plateformes de calcul haute performance (HPC) sont la solution idéale pour exécuter des applicat...
Cette thèse s'intéresse à la résilience pour les applications haute performance à très grande échell...
Building an infrastructure for exascale applications requires, in addition to many other key compone...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
International audienceMany methods are available to detect silent errors in high-performance computi...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
International audienceMany methods are available to detect silent errors in high-performance computi...
Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) app...
This report describes a unified framework for the detection and correction of silent errors,which co...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
This thesis focuses on resilience for high performance applications that execute on large scale plat...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
Les plateformes de calcul haute performance (HPC) sont la solution idéale pour exécuter des applicat...
Cette thèse s'intéresse à la résilience pour les applications haute performance à très grande échell...
Building an infrastructure for exascale applications requires, in addition to many other key compone...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...