As the desire of scientists to perform ever larger computations drives the size of today’s high performance computers from hundreds, to thousands, and even tens of thousands of processors, node failures in these computers are becoming frequent events. Although checkpoint/rollback-recovery is the typical technique to tolerate such failures, it often introduces a considerable overhead, especially when applications modify a large mount of memory between checkpoints. This paper presents an algorithm-based checkpoint-free fault tolerance approach in which, instead of taking checkpoints periodically, a coded global consistent state of the critical application data is maintained in memory by modifying applications to operate on encoded data. Altho...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have b...
Several recovery techniques for parallel iterative methods are presented. First, the implementation ...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
As the number of processors in today’s parallel systems continues to grow, the mean-time-to-failure ...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
As parallel and distributed systems scale to hundreds of thousands of cores and beyond, fault tolera...
The probability that a failure will occur before the end of the computation increases as the number ...
In this paper, we extend the theory of algorithmic fault-tolerant matrix-matrix mul-tiplication, C =...
Checkpoint and recovery cost imposed by checkpoint/restart (CP/R) is a crucial performance issue for...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have b...
Several recovery techniques for parallel iterative methods are presented. First, the implementation ...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
As the number of processors in today’s parallel systems continues to grow, the mean-time-to-failure ...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
As parallel and distributed systems scale to hundreds of thousands of cores and beyond, fault tolera...
The probability that a failure will occur before the end of the computation increases as the number ...
In this paper, we extend the theory of algorithmic fault-tolerant matrix-matrix mul-tiplication, C =...
Checkpoint and recovery cost imposed by checkpoint/restart (CP/R) is a crucial performance issue for...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have b...
Several recovery techniques for parallel iterative methods are presented. First, the implementation ...