International audienceWith the increased failure rate expected in future extreme scale supercomputers, process replication might become a viable alternative to checkpointing. By default, the workload efficiency of replication is limited to 50% because of the additional resources that have to be used to execute the replicas of the application's processes. In this paper, we introduce intra-parallelization, a solution that avoids replicating all computation by introducing work-sharing between replicas. We show on a representative set of benchmarks that intra-parallelization allows achieving more than 50% efficiency without compromising fault tolerance
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
Processor failures in post-petascale settings are common occurrences. The traditional fault-toleranc...
International audienceWith the increased failure rate expected in future extreme scale supercomputer...
International audienceWith the increased failure rate expected in future extreme scale supercomputer...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceReplication has recently gained attention in the context of fault tolerance fo...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceReplication has recently gained attention in the context of fault tolerance fo...
Abstract. As computational clusters increase in size, their mean-time-to-failure reduces. Typically ...
Abstract—As computational clusters increase in size, their mean-time-to-failure reduces drastically....
Abstract—As computational clusters increase in size, their mean-time-to-failure reduces drastically....
High performance computing applications must be tolerant to faults, which are common occurrences esp...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
Processor failures in post-petascale settings are common occurrences. The traditional fault-toleranc...
International audienceWith the increased failure rate expected in future extreme scale supercomputer...
International audienceWith the increased failure rate expected in future extreme scale supercomputer...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceReplication has recently gained attention in the context of fault tolerance fo...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceReplication has recently gained attention in the context of fault tolerance fo...
Abstract. As computational clusters increase in size, their mean-time-to-failure reduces. Typically ...
Abstract—As computational clusters increase in size, their mean-time-to-failure reduces drastically....
Abstract—As computational clusters increase in size, their mean-time-to-failure reduces drastically....
High performance computing applications must be tolerant to faults, which are common occurrences esp...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
Processor failures in post-petascale settings are common occurrences. The traditional fault-toleranc...