International audienceProcessor failures in post-petascale parallel computing platforms are common occurrences. The traditional fault-tolerance solution, checkpoint-rollback-recovery, severely limits parallel efficiency. One solution is to replicate application processes so that a processor failure does not necessarily imply an application failure. Process replication, combined with checkpoint-rollback-recovery, has been recently advocated. We first derive novel theoretical results for Exponential failure distributions, namely exact values for the Mean Number of Failures To Interruption and the Mean Time To Interruption. We then extend these results to arbitrary failure distributions, obtaining closed-form solutions for Weibull distribution...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
Processor failures in post-petascale settings are common occurrences. The traditional fault-toleranc...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
International audienceWith the increased failure rate expected in future extreme scale supercomputer...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
Processor failures in post-petascale settings are common occurrences. The traditional fault-toleranc...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
International audienceWith the increased failure rate expected in future extreme scale supercomputer...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...