This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables the application to survive many fail-stop errors, thereby allowing for longer checkpointing periods. Previously published works use replication with the \norestart strategy, which works as follows: (i) compute the application Mean Time To Interruption (MTTI) \MM as a function of the number of processor pairs and the individual processor Mean Time Between Failures (MTBF); (ii) use checkpointing period $\pdalyrep = \sqrt{2 \MM \CC}$ à la Young/Daly, where $\CC$ is the checkpoint duration; and (iii) never restart failed processors until the application crashes. We introduce the \restart strategy where failed processors are restarted after ea...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
This paper revisits replication coupled with checkpointing for fail-stop errors.Replication enables ...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
This paper revisits replication coupled with checkpointing for fail-stop errors.Replication enables ...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
This paper revisits replication coupled with checkpointing for fail-stop errors.Replication enables ...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
This paper revisits replication coupled with checkpointing for fail-stop errors.Replication enables ...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...