The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For long running applications using a large number of processors, it is essential that fault tolerance be used to prevent a total loss of all finished computations after a failure. While check-pointing has been very useful to tolerate failures for a long time, it often introduces a considerable overhead especially when applications modify a large amount of memory be-tween checkpoints and the number of processors is large. In this paper, we propose an algorithm-based recovery scheme for the High Performance Linpack benchmark (which modi-fies a large amount of memory in e...
Algorithm based fault tolerance (ABFT) attracts renewed interest for its extremely low overhead and ...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
In high-performance systems, the probability of failure is higher for larger systems. The probabili...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
As the desire of scientists to perform ever larger computations drives the size of today’s high perf...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications tha...
Abstract—Modeling and analysis of large scale scientific systems often use linear least squares regr...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondi...
Algorithm based fault tolerance (ABFT) attracts renewed interest for its extremely low overhead and ...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
In high-performance systems, the probability of failure is higher for larger systems. The probabili...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
As the desire of scientists to perform ever larger computations drives the size of today’s high perf...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications tha...
Abstract—Modeling and analysis of large scale scientific systems often use linear least squares regr...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondi...
Algorithm based fault tolerance (ABFT) attracts renewed interest for its extremely low overhead and ...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...