The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale computers are expected to have a MTBF of around 30 minutes. Therefore, it is urgent to prepare important algorithms for future machines with such a short MTBF. Eigenvalue problems (EVP) and singular value problems (SVP) are common in engineering and scientific research. Solving EVP and SVP numerically involves two-sided matrix factorizations: the Hessenberg reduction, the tridiagonal reduction, and the bidiagonal reduction. These three factorizations are computation intensive, and have long running times. They are prone to suffer from computer failures. We designed algorithm-based fault tolerant (ABFT) algorithms for the parallel Hessenberg reduc...
none3noAs large-scale linear equation systems are pervasive in many scientific fields, great efforts...
As the number of processors in today’s parallel systems continues to grow, the mean-time-to-failure ...
Checkpoint and recovery cost imposed by checkpoint/restart (CP/R) is a crucial performance issue for...
Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications tha...
This paper presents an algorithm based fault tolerance method to harden three two-sided matrix facto...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Emerging high-performance computing platforms, with large component counts and lower power margins, ...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
An important consideration in the design of high performance multiprocessor systems is to ensure the...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
none3noAs large-scale linear equation systems are pervasive in many scientific fields, great efforts...
As the number of processors in today’s parallel systems continues to grow, the mean-time-to-failure ...
Checkpoint and recovery cost imposed by checkpoint/restart (CP/R) is a crucial performance issue for...
Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications tha...
This paper presents an algorithm based fault tolerance method to harden three two-sided matrix facto...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Emerging high-performance computing platforms, with large component counts and lower power margins, ...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
An important consideration in the design of high performance multiprocessor systems is to ensure the...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
none3noAs large-scale linear equation systems are pervasive in many scientific fields, great efforts...
As the number of processors in today’s parallel systems continues to grow, the mean-time-to-failure ...
Checkpoint and recovery cost imposed by checkpoint/restart (CP/R) is a crucial performance issue for...