International audienceAs the computational power of high performance computing (HPC) systems continues to increase by using a huge number of cores or specialized processing units, HPC applications are increasingly prone to faults. In this paper, we present a new class of numerical fault tolerance algorithms to cope with node crashes in parallel distributed environments. This new resilient scheme is designed at application level and does not require extra resources, i.e., computational unit or computing time, when no fault occurs. In the framework of iterative methods for the solution of sparse linear systems, we present numerical algorithms to extract relevant information from available data after a fault, assuming a separate mechanism ensu...
On future extreme scale computers, it is expected that faults will become an increasingly serious pr...
This report presents a method to recover from faults detected by hardware in numerical iterative sol...
This paper continues to develop a fault tolerant extension of the sparse grid combination technique ...
International audience: The advent of extreme scale machines will require the use of parallel resour...
International audienceThe advent of extreme scale machines will require the use of parallel resource...
International audienceAs the computational power of high performance computing (HPC) systems continu...
International audienceIn this talk we will discuss possible numerical remedies to survive data loss...
International audiencehe advent of extreme scale machines will require the use of parallel resources...
International audienceIn this talk we will discuss possible numerical remedies to survive data loss ...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
International audienceThe advent of extreme scale machines will require the use of parallel resource...
Several recovery techniques for parallel iterative methods are presented. First, the implementation ...
International audienceThe solution of large eigenproblems is involved in many scientific and enginee...
Large scale simulations are used in a variety of application areas in science and engineering to hel...
This paper continues to develop a fault tolerant extension of the sparse grid combination technique ...
On future extreme scale computers, it is expected that faults will become an increasingly serious pr...
This report presents a method to recover from faults detected by hardware in numerical iterative sol...
This paper continues to develop a fault tolerant extension of the sparse grid combination technique ...
International audience: The advent of extreme scale machines will require the use of parallel resour...
International audienceThe advent of extreme scale machines will require the use of parallel resource...
International audienceAs the computational power of high performance computing (HPC) systems continu...
International audienceIn this talk we will discuss possible numerical remedies to survive data loss...
International audiencehe advent of extreme scale machines will require the use of parallel resources...
International audienceIn this talk we will discuss possible numerical remedies to survive data loss ...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
International audienceThe advent of extreme scale machines will require the use of parallel resource...
Several recovery techniques for parallel iterative methods are presented. First, the implementation ...
International audienceThe solution of large eigenproblems is involved in many scientific and enginee...
Large scale simulations are used in a variety of application areas in science and engineering to hel...
This paper continues to develop a fault tolerant extension of the sparse grid combination technique ...
On future extreme scale computers, it is expected that faults will become an increasingly serious pr...
This report presents a method to recover from faults detected by hardware in numerical iterative sol...
This paper continues to develop a fault tolerant extension of the sparse grid combination technique ...