Fault-tolerance protocols play an important role in today long runtime scienti\ufb01c parallel applications. The probability of a failure may be important due to the number of unreliable components involved during an execution. We present our approach and preliminary results about a new checkpoint/rollback protocol based on a coordinated scheme. The application is described using a dataflow graph, which is an abstract representation of the execution. Thanks to this representation, the fault recovery in our protocol only requires a partial restart of other processes. Simulations on a domain decomposition application show that the amount of computations required to restart and the number of involved processes are reduced compared to the class...
Checkpointing is a very well known mechanism to achieve fault tolerance. In distributed applications...
Large-scale graph and machine learning analytics widely employ distributed iterative processing. Typ...
International audiencen this paper a new checkpoint/recovery protocol called theft-induced checkpoin...
Fault-tolerance protocols play an important role in today long runtime scienti\ufb01c parallel appli...
International audienceFault tolerance protocols play an important role in today long runtime scienti...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
Checkpointing is a very well known mechanism to achieve fault tolerance. In distributed applications...
International audienceFailure free execution will become rare in the future exascale computers. Thus...
Abstract Index-based checkpointing allows the use of simple and efficient algorithms for dom-ino-eff...
Real-world graph processing applications often require combining the graph data with tabular data. M...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
AbstractThe execution times of large-scale parallel applications on modern multi/many-core systems a...
Checkpointing is a very well known mechanism to achieve fault tolerance. In distributed applications...
Large-scale graph and machine learning analytics widely employ distributed iterative processing. Typ...
International audiencen this paper a new checkpoint/recovery protocol called theft-induced checkpoin...
Fault-tolerance protocols play an important role in today long runtime scienti\ufb01c parallel appli...
International audienceFault tolerance protocols play an important role in today long runtime scienti...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
Checkpointing is a very well known mechanism to achieve fault tolerance. In distributed applications...
International audienceFailure free execution will become rare in the future exascale computers. Thus...
Abstract Index-based checkpointing allows the use of simple and efficient algorithms for dom-ino-eff...
Real-world graph processing applications often require combining the graph data with tabular data. M...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
AbstractThe execution times of large-scale parallel applications on modern multi/many-core systems a...
Checkpointing is a very well known mechanism to achieve fault tolerance. In distributed applications...
Large-scale graph and machine learning analytics widely employ distributed iterative processing. Typ...
International audiencen this paper a new checkpoint/recovery protocol called theft-induced checkpoin...