We examine novel fault tolerance schemes for data loss in multigrid solvers which essentially combine ideas of checkpoint-restart with algorithm-based fault toler-ance. To improve efficiency compared to conventional global checkpointing, we exploit the inherent data compression of the multigrid hierarchy, and relax the synchronicity requirement through a local failure local recovery approach. We experimentally identify the root cause of convergence degradation in the pres-ence of data loss using smoothness considerations. Our resulting schemes form a family of techniques, that can be tailored to the expected error probability of (future) large-scale machines. A performance model gives further insight into the benefits and applicability of o...
AbstractA key issue confronting petascale and exascale computing is the growth in probability of sof...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
Several recovery techniques for parallel iterative methods are presented. First, the implementation ...
On future extreme scale computers, it is expected that faults will become an increasingly serious pr...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
This paper continues to develop a fault tolerant extension of the sparse grid combination technique ...
This paper continues to develop a fault tolerant extension of the sparse grid combination technique ...
matsu {at} is.titech.ac.jp Large scientific applications deployed on current petascale systems expen...
Abstract—The GridRPC model is well suited for high per-formance computing on grids thanks to efficie...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
The running times of large–scale computational science and engineering parallel applications, execut...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
A key issue confronting petascale and exascale computing is the growth in probability of soft and ha...
International audienceAs high performance platforms (Clusters, Grids, etc.) continue to grow in size...
AbstractA key issue confronting petascale and exascale computing is the growth in probability of sof...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
Several recovery techniques for parallel iterative methods are presented. First, the implementation ...
On future extreme scale computers, it is expected that faults will become an increasingly serious pr...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
This paper continues to develop a fault tolerant extension of the sparse grid combination technique ...
This paper continues to develop a fault tolerant extension of the sparse grid combination technique ...
matsu {at} is.titech.ac.jp Large scientific applications deployed on current petascale systems expen...
Abstract—The GridRPC model is well suited for high per-formance computing on grids thanks to efficie...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
The running times of large–scale computational science and engineering parallel applications, execut...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
A key issue confronting petascale and exascale computing is the growth in probability of soft and ha...
International audienceAs high performance platforms (Clusters, Grids, etc.) continue to grow in size...
AbstractA key issue confronting petascale and exascale computing is the growth in probability of sof...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
Several recovery techniques for parallel iterative methods are presented. First, the implementation ...