International audienceMany methods are available to detect silent errors in high-performance computing (HPC) applications. Each comes with a given cost and recall (fraction of all errors that are actually detected). The main contribution of this paper is to characterize the optimal computational pattern for an application: which detector(s) to use, how many detectors of each type to use, together with the length of the work segment that precedes each of them. We conduct a comprehensive complexity analysis of this optimization problem, showing NP-completeness and designing an FPTAS (Fully Polynomial-Time Approximation Scheme). On the practical side, we provide a greedy algorithm whose performance is shown to be close to the optimal for a rea...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
Tesis de Graduación (Maestría en Computación) Instituto Tecnológico de Costa Rica, Escuela de Comput...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceMany methods are available to detect silent errors in high-performance computi...
Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) app...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
Resilient algorithms in high-performance computing are subject to rigorous non-functional constrain...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
Tesis de Graduación (Maestría en Computación) Instituto Tecnológico de Costa Rica, Escuela de Comput...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceMany methods are available to detect silent errors in high-performance computi...
Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) app...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
Resilient algorithms in high-performance computing are subject to rigorous non-functional constrain...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
Tesis de Graduación (Maestría en Computación) Instituto Tecnológico de Costa Rica, Escuela de Comput...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...