Abstract—Nowadays, clusters are widely used to execute scientific applications. These applications are often message-passing parallel applications with long execution time. Since the number of nodes in clusters is growing, the probability of a node failure during the execution of an application increases and the application execution time may be greater than the cluster mean time between failures (MTBF). To avoid restarting appli-cation from the beginning, some fault tolerant mechanisms such as checkpoint/restart are needed. Currently, checkpoint/restart mechanisms are either implemented directly in the application source code by applications programmers or are integrated in communication environments such as MPI or PVM. We propose in this ...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
Transparent hypervisor-level checkpoint-restart mechanisms for virtual clusters (VCs) or clusters of...
Cluster federations are very useful for applications like large scale code coupling. Faults may appe...
Nowadays, clusters are widely used to execute scientific applications. These applications are often ...
Abstract — Nowadays, clusters are widely used to execute scientific applications. These applications...
Ultra-scale computer clusters with high speed interconnects, such as InfiniBand, are being widely de...
Ultra-scale computer clusters with high speed intercon-nects, such as InfiniBand, are being widely d...
Abstract: Checkpointing is a procedure of storing process state to a file, which is later used to re...
In a scientific community that increasingly relies upon High Performance Computing (HPC) for large s...
This paper introduces a novel approach in parallel checkpointing aimed at supporting fault-tolerance...
Also available as an INRIA Research Report 5091: http://www.inria.fr/rrrt/rr-5091.htmlA new kind of ...
This paper describes issues in the design and implementation of checkpointing and recovery modules f...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
As the size of high performance clusters multiplies, the prob-ability of system failure grows substa...
The ever increasing number of processors used in parallel computers is making fault tolerance suppor...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
Transparent hypervisor-level checkpoint-restart mechanisms for virtual clusters (VCs) or clusters of...
Cluster federations are very useful for applications like large scale code coupling. Faults may appe...
Nowadays, clusters are widely used to execute scientific applications. These applications are often ...
Abstract — Nowadays, clusters are widely used to execute scientific applications. These applications...
Ultra-scale computer clusters with high speed interconnects, such as InfiniBand, are being widely de...
Ultra-scale computer clusters with high speed intercon-nects, such as InfiniBand, are being widely d...
Abstract: Checkpointing is a procedure of storing process state to a file, which is later used to re...
In a scientific community that increasingly relies upon High Performance Computing (HPC) for large s...
This paper introduces a novel approach in parallel checkpointing aimed at supporting fault-tolerance...
Also available as an INRIA Research Report 5091: http://www.inria.fr/rrrt/rr-5091.htmlA new kind of ...
This paper describes issues in the design and implementation of checkpointing and recovery modules f...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
As the size of high performance clusters multiplies, the prob-ability of system failure grows substa...
The ever increasing number of processors used in parallel computers is making fault tolerance suppor...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
Transparent hypervisor-level checkpoint-restart mechanisms for virtual clusters (VCs) or clusters of...
Cluster federations are very useful for applications like large scale code coupling. Faults may appe...