Resilience has become a critical problem for high performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their particularities is that such errors are identified only when the corrupted data is activated. To cope with silent errors, we need a verification mechanism to check whether the application state is correct. Checkpoints should be supplemented with verifications to detect silent errors. When a verification is successful, only the last checkpoint needs to be kept in memory because it is known to be correct. In this paper, we analytically determine the best balance of verifications and checkpoints so as to optimize platform throughput. We i...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
International audienceSilent errors, or silent data corruptions, constitute a major threat on very l...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
Resilience has become a critical problem for high performance computing. Checkpointing proto-cols ar...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. ...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
International audienceSilent errors, or silent data corruptions, constitute a major threat on very l...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
Resilience has become a critical problem for high performance computing. Checkpointing proto-cols ar...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
In this paper, we combine the traditional checkpointing and rollback recovery strategies with verifi...
Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. ...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
International audienceSilent errors, or silent data corruptions, constitute a major threat on very l...