Several recovery techniques for parallel iterative methods are presented. First, the implementation of checkpoints in parallel iterative methods is described and analyzed. Then, a simple checkpoint-free fault tolerant scheme for parallel iterative methods, the lossy approach, is presented. When one processor fails and all its data is lost, the system is recovered by computing a new approximate solution using the data of the non-failed processors. The iterative method is then restarted with this new vector. The main advantage of the lossy approach over standard checkpoint algorithms is that it does not increase the computational cost of the iterative solver, when no failure occurs. Experiments are presented that compare the different techniq...
Executing data-parallel iterative algorithms on large datasets is cru-cial for many advanced analyti...
Checkpoint and recovery cost imposed by checkpoint/restart (CP/R) is a crucial performance issue for...
Advancement in computational speed is nowadays gained by using more processing units rather than fas...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
International audienceAs the computational power of high performance computing (HPC) systems continu...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
International audienceThe solution of large eigenproblems is involved in many scientific and enginee...
Energy increasingly constrains modern computer hardware, yet protecting computations and data agains...
On future extreme scale computers, it is expected that faults will become an increasingly serious pr...
Abstract. Most predictions of Exascale machines picture billion way parallelism, encompassing not on...
This report presents a method to recover from faults detected by hardware in numerical iterative sol...
As the desire of scientists to perform ever larger computations drives the size of today’s high perf...
International audienceThe advent of extreme scale machines will require the use of parallel resource...
Executing data-parallel iterative algorithms on large datasets is cru-cial for many advanced analyti...
Checkpoint and recovery cost imposed by checkpoint/restart (CP/R) is a crucial performance issue for...
Advancement in computational speed is nowadays gained by using more processing units rather than fas...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
International audienceAs the computational power of high performance computing (HPC) systems continu...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
International audienceThe solution of large eigenproblems is involved in many scientific and enginee...
Energy increasingly constrains modern computer hardware, yet protecting computations and data agains...
On future extreme scale computers, it is expected that faults will become an increasingly serious pr...
Abstract. Most predictions of Exascale machines picture billion way parallelism, encompassing not on...
This report presents a method to recover from faults detected by hardware in numerical iterative sol...
As the desire of scientists to perform ever larger computations drives the size of today’s high perf...
International audienceThe advent of extreme scale machines will require the use of parallel resource...
Executing data-parallel iterative algorithms on large datasets is cru-cial for many advanced analyti...
Checkpoint and recovery cost imposed by checkpoint/restart (CP/R) is a crucial performance issue for...
Advancement in computational speed is nowadays gained by using more processing units rather than fas...