In this work we propose partial task replication and check-pointing for task-parallel HPC applications to mitigate silent data corruption (SDC) errors. As the complete replication of all application tasks can be prohibitive due to resource costs, we introduce programmer-directed selective replication mechanism to provide fault-tolerance while decreasing costs. Results show that our scheme detects and corrects around 65% of SDC errors with only 4% overhead on average.Peer ReviewedPostprint (published version
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
In this paper we propose a runtime-based selective task replication technique for task-parallel high...
In this work we propose partial task replication and check-pointing for task-parallel HPC applicatio...
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Perfo...
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and sile...
International audienceDumping large amounts of related data simulta-neously to local storage devices...
International audienceThis paper provides a model and an analytical study of replication as a techni...
Abstract. Proposed exascale systems will present a number of consid-erable resiliency challenges. In...
As the number of nodes in high-performance computing environments keeps increasing, faults are becom...
Redundant multithreading (RMT) is an effective reliability solution that provides thread-level repli...
Memory reliability will be one of the major concerns for future HPC and Exascale systems. This conce...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondi...
International audienceThis chapter describes a unified framework for the detection and correction of...
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
In this paper we propose a runtime-based selective task replication technique for task-parallel high...
In this work we propose partial task replication and check-pointing for task-parallel HPC applicatio...
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Perfo...
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and sile...
International audienceDumping large amounts of related data simulta-neously to local storage devices...
International audienceThis paper provides a model and an analytical study of replication as a techni...
Abstract. Proposed exascale systems will present a number of consid-erable resiliency challenges. In...
As the number of nodes in high-performance computing environments keeps increasing, faults are becom...
Redundant multithreading (RMT) is an effective reliability solution that provides thread-level repli...
Memory reliability will be one of the major concerns for future HPC and Exascale systems. This conce...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondi...
International audienceThis chapter describes a unified framework for the detection and correction of...
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
In this paper we propose a runtime-based selective task replication technique for task-parallel high...