International audienceWith increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence, which makes reliability a difficult challenge. Although for some applications it is enough to restart failed tasks, there is a large class of applications where tasks run for a long time or are tightly coupled, thus making a restart from scratch unfeasible. Checkpoint-Restart (CR), the main method to survive failures for such applications faces additional challenges in this context: not only does it need to minimize the performance overhead on the application due to checkpointing, but it also needs to operate with scarce resources. Given the iterative nature of the targeted applications, ...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
International audienceFuture high performance computing systems will need to use novel techniques to...
International audienceThe ever-increasing number of computation units assembled in current HPC platf...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
International audienceIn this paper, we aim at optimizing fault-tolerance tech- niques based on a ch...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
Computational power demand for large challenging problems has increasingly driven the physical size ...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
International audienceFuture high performance computing systems will need to use novel techniques to...
International audienceThe ever-increasing number of computation units assembled in current HPC platf...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
International audienceIn this paper, we aim at optimizing fault-tolerance tech- niques based on a ch...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
Computational power demand for large challenging problems has increasingly driven the physical size ...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
International audienceFuture high performance computing systems will need to use novel techniques to...
International audienceThe ever-increasing number of computation units assembled in current HPC platf...