In this paper, we design and analyze strategies to replicate the execution of an application on two different platformssubject to failures, using checkpointing on a shared stable storage. We derive the optimal pattern size~$W$for a periodic checkpointing strategy where both platforms concurrently try and execute $W$ units of work before checkpointing. The first platform that completes its pattern takes a checkpoint,and the other platform interrupts its execution to synchronize from that checkpoint.We compare this strategy to a simpler on-failure checkpointing strategy, where a checkpoint is taken by one platformonly whenever the other platform encounters a failure. We use first or second-order approximations to computeoverheads and optimal ...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...
This work provides an optimal checkpointing strategy to protect iterative applications from fail-sto...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
In high-performance computing environments, input/output (I/O) from varioussources often contend for...
In high-performance computing environments, input/output (I/O) from varioussources often contend for...
In high-performance computing environments, input/output (I/O) from varioussources often contend for...
In high-performance computing environments, input/output (I/O) from varioussources often contend for...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
Le passage de l'échelle des nouvelles plates-formes de calcul parallèle et distribué soulève de nomb...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...
This work provides an optimal checkpointing strategy to protect iterative applications from fail-sto...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
In high-performance computing environments, input/output (I/O) from varioussources often contend for...
In high-performance computing environments, input/output (I/O) from varioussources often contend for...
In high-performance computing environments, input/output (I/O) from varioussources often contend for...
In high-performance computing environments, input/output (I/O) from varioussources often contend for...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
Le passage de l'échelle des nouvelles plates-formes de calcul parallèle et distribué soulève de nomb...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...
This work provides an optimal checkpointing strategy to protect iterative applications from fail-sto...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...