This thesis focuses on resilience for high performance applications that execute on large scale platforms, with millions of processing cores. On such platforms, errors are the norm rather than the exception. We consider two types of errors: fail-stop errors, which generally cause the application to stop, and silent-errors, a.k.a. Silent Data Corruption or SDCs, which can corrupt data in memory. Silent errors pose a new threat to scientific applications, because they are both difficult to detect and to correct. In this thesis, we first study several detection mechanisms for silent errors. We model the impact of such detectors on the execution of scientific applications, which allows us to decide which one to use when multiple choices are ava...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
This thesis deals with two issues for future Exascale platforms, namely resilience and energy. We ad...
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direc...
This thesis focuses on resilience for high performance applications that execute on large scale plat...
Cette thèse s'intéresse à la résilience pour les applications haute performance à très grande échell...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
This report describes a unified framework for the detection and correction of silent errors,which co...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
Resilient algorithms in high-performance computing are subject to rigorous non-functional constrain...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
This thesis is focused on the two major problems in the high performance computing context: resilien...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
This thesis deals with two issues for future Exascale platforms, namely resilience and energy. We ad...
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direc...
This thesis focuses on resilience for high performance applications that execute on large scale plat...
Cette thèse s'intéresse à la résilience pour les applications haute performance à très grande échell...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
This report describes a unified framework for the detection and correction of silent errors,which co...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
Resilient algorithms in high-performance computing are subject to rigorous non-functional constrain...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
This thesis is focused on the two major problems in the high performance computing context: resilien...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
This thesis deals with two issues for future Exascale platforms, namely resilience and energy. We ad...
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direc...