International audienceFault tolerance protocols play an important role in today long runtime scientific parallel applications because the probability of failure may be important due to the number of unreliable components involved during simulation. In this paper we present our approach and preliminary results about a new checkpoint/recovery protocol based on a coordinated scheme. This protocol is highly coupled to the availability of an abstract representation of the execution
In order to provide fault tolerance for distributed systems, the checkpointing technique has widely ...
Due to the character of the original source materials and the nature of batch digitization, quality ...
We consider the problem of bringing a distributed system to a consistent state after transient fail...
International audienceFault tolerance protocols play an important role in today long runtime scienti...
Fault-tolerance protocols play an important role in today long runtime scienti\ufb01c parallel appli...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Abstract Index-based checkpointing allows the use of simple and efficient algorithms for dom-ino-eff...
AbstractThe execution times of large-scale parallel applications on modern multi/many-core systems a...
International audiencen this paper a new checkpoint/recovery protocol called theft-induced checkpoin...
This thesis studies a forward recovery strategy using checkpointing and optimistic execution in para...
This paper presents a checkpointing-recovery scheme for Time Warp parallel simulation. The scheme re...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
In order to provide fault tolerance for distributed systems, the checkpointing technique has widely ...
Due to the character of the original source materials and the nature of batch digitization, quality ...
We consider the problem of bringing a distributed system to a consistent state after transient fail...
International audienceFault tolerance protocols play an important role in today long runtime scienti...
Fault-tolerance protocols play an important role in today long runtime scienti\ufb01c parallel appli...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Abstract Index-based checkpointing allows the use of simple and efficient algorithms for dom-ino-eff...
AbstractThe execution times of large-scale parallel applications on modern multi/many-core systems a...
International audiencen this paper a new checkpoint/recovery protocol called theft-induced checkpoin...
This thesis studies a forward recovery strategy using checkpointing and optimistic execution in para...
This paper presents a checkpointing-recovery scheme for Time Warp parallel simulation. The scheme re...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
In order to provide fault tolerance for distributed systems, the checkpointing technique has widely ...
Due to the character of the original source materials and the nature of batch digitization, quality ...
We consider the problem of bringing a distributed system to a consistent state after transient fail...