Abstract—As hardware devices like processor cores and memory sub-systems based on nano-scale technology nodes become more unreliable, the need for fault tolerant numerical computing engines, as used in many critical applications with long computation/mission times, is becoming pronounced. In this paper, we present an Algorithm-based Fault Tolerance (ABFT) scheme for an iterative linear solver engine based on the Conjugated Gradient method (CG) by taking the advantage of numerical defect correction. This method is “pay as you go”, meaning that there is practically only a runtime overhead if errors occur and a correction is performed. Our experimental comparison with software-based Triple Modular Redundancy (TMR) clearly shows the runtime ben...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
As late-CMOS process scaling leads to increasingly variable circuits/logic and as most post-CMOS tec...
As we stride toward the exascale era, due to increasing complexity of supercomputers, hard and soft ...
Soft errors are increasing in modern computer systems. These faults can corrupt the results of nume...
International audienceIn this talk we will discuss possible numerical remedies to survive data loss...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
Energy increasingly constrains modern computer hardware, yet protecting computations and data agains...
Several recent papers have introduced a periodic verification mechanism to detect silent errors i...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direc...
International audienceThe advent of extreme scale machines will require the use of parallel resource...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
We present a fault model designed to bring out the “worst” in iterative solvers based on mathematica...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
As late-CMOS process scaling leads to increasingly variable circuits/logic and as most post-CMOS tec...
As we stride toward the exascale era, due to increasing complexity of supercomputers, hard and soft ...
Soft errors are increasing in modern computer systems. These faults can corrupt the results of nume...
International audienceIn this talk we will discuss possible numerical remedies to survive data loss...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
Energy increasingly constrains modern computer hardware, yet protecting computations and data agains...
Several recent papers have introduced a periodic verification mechanism to detect silent errors i...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direc...
International audienceThe advent of extreme scale machines will require the use of parallel resource...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
We present a fault model designed to bring out the “worst” in iterative solvers based on mathematica...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
As late-CMOS process scaling leads to increasingly variable circuits/logic and as most post-CMOS tec...
As we stride toward the exascale era, due to increasing complexity of supercomputers, hard and soft ...