International audienceSilent errors, or silent data corruptions, constitute a major threat on very large scale platforms. When a silent error strikes, it is not detected immediately but only after some delay, which prevents the use of pure periodic check pointing approaches devised for fail-stop errors. Instead, check pointing must be coupled with some verification mechanism to guarantee that corrupted data will never be written into the checkpoint file. Such a guaranteed verification mechanism typically incurs a high cost. In this paper, we assess the impact of using partial verification mechanisms in addition to a guaranteed verification. The main objective is to investigate to which extent it is worthwhile to use some light cost but less...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceThis paper provides a model and an analytical study of replication as a techni...
Several recent papers have introduced a periodic verification mechanism to detect silent errorsin it...
International audienceSilent errors, or silent data corruptions, constitute a major threat on very l...
Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. ...
Resilience has become a critical problem for high performance computing. Checkpointing proto-cols ar...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) app...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceThis paper provides a model and an analytical study of replication as a techni...
Several recent papers have introduced a periodic verification mechanism to detect silent errorsin it...
International audienceSilent errors, or silent data corruptions, constitute a major threat on very l...
Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. ...
Resilience has become a critical problem for high performance computing. Checkpointing proto-cols ar...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) app...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceThis paper provides a model and an analytical study of replication as a techni...
Several recent papers have introduced a periodic verification mechanism to detect silent errorsin it...