In this work we propose partial task replication and check-pointing for task-parallel HPC applications to mitigate silent data corruption (SDC) errors. As the complete replication of all application tasks can be prohibitive due to resource costs, we introduce programmer-directed selective replication mechanism to provide fault-tolerance while decreasing costs. Results show that our scheme detects and corrects around 65% of SDC errors with only 4% overhead on average.Peer Reviewe
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent...
In this paper we propose a runtime-based selective task replication technique for task-parallel high...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
In this work we propose partial task replication and check-pointing for task-parallel HPC applicatio...
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Perfo...
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and sile...
International audienceDumping large amounts of related data simulta-neously to local storage devices...
International audienceThis paper provides a model and an analytical study of replication as a techni...
Abstract. Proposed exascale systems will present a number of consid-erable resiliency challenges. In...
As the number of nodes in high-performance computing environments keeps increasing, faults are becom...
Redundant multithreading (RMT) is an effective reliability solution that provides thread-level repli...
Memory reliability will be one of the major concerns for future HPC and Exascale systems. This conce...
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondi...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
International audienceThis chapter describes a unified framework for the detection and correction of...
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent...
In this paper we propose a runtime-based selective task replication technique for task-parallel high...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
In this work we propose partial task replication and check-pointing for task-parallel HPC applicatio...
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Perfo...
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and sile...
International audienceDumping large amounts of related data simulta-neously to local storage devices...
International audienceThis paper provides a model and an analytical study of replication as a techni...
Abstract. Proposed exascale systems will present a number of consid-erable resiliency challenges. In...
As the number of nodes in high-performance computing environments keeps increasing, faults are becom...
Redundant multithreading (RMT) is an effective reliability solution that provides thread-level repli...
Memory reliability will be one of the major concerns for future HPC and Exascale systems. This conce...
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondi...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
International audienceThis chapter describes a unified framework for the detection and correction of...
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent...
In this paper we propose a runtime-based selective task replication technique for task-parallel high...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...