International audienceThis paper compares several fault-tolerance methods for the detection and correction of floating-point errors in matrix-matrix multiplication. These methods include replication, triplication, Algorithm-Based Fault Tolerance (ABFT) and residual checking (RC). Error correction for ABFT can be achieved either by solving a small-size linear system of equations, or by recomputing corrupted coefficients. We show that both approaches can be used for RC. We provide a synthetic presentation of all methods before discussing their pros and cons. We have implemented all these methods with calls to optimized BLAS routines, and we provide performance data for a wide range of failure rates and matrix sizes
Reliable floating-point arithmetic is vital for dependable computing systems. It is also important f...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
A fault-tolerant array for matrix multiplication that explicitly incorporates mechanisms for easy te...
International audienceThis paper compares several fault-tolerance methods for the detection and corr...
In this paper, we extend the theory of algorithmic fault-tolerant matrix-matrix mul-tiplication, C =...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
This paper presents an algorithm based fault tolerance method to harden three two-sided matrix facto...
International audienceDue to non-associativity of floating-point operations and dynamic schedu...
In Algorithm-based fault tolerance (ABFT), fault tolerance is tailored to the algorithm performed. M...
In Algorithm-based fault tolerance (ABFT), fault tolerance is tailored to the algorithm performed. M...
An important consideration in the design of high performance multiprocessor systems is to ensure the...
Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications tha...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
Reliable floating-point arithmetic is vital for dependable computing systems. It is also important f...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
A fault-tolerant array for matrix multiplication that explicitly incorporates mechanisms for easy te...
International audienceThis paper compares several fault-tolerance methods for the detection and corr...
In this paper, we extend the theory of algorithmic fault-tolerant matrix-matrix mul-tiplication, C =...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
This paper presents an algorithm based fault tolerance method to harden three two-sided matrix facto...
International audienceDue to non-associativity of floating-point operations and dynamic schedu...
In Algorithm-based fault tolerance (ABFT), fault tolerance is tailored to the algorithm performed. M...
In Algorithm-based fault tolerance (ABFT), fault tolerance is tailored to the algorithm performed. M...
An important consideration in the design of high performance multiprocessor systems is to ensure the...
Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications tha...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
Reliable floating-point arithmetic is vital for dependable computing systems. It is also important f...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
A fault-tolerant array for matrix multiplication that explicitly incorporates mechanisms for easy te...