International audience—The traditional single-level checkpointing method suffers from significant overhead on large-scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in recent years. The multilevel checkpoint approach allows different levels of checkpoints to be set (each with different checkpoint overheads and recovery abilities), in order to further improve the fault tolerance performance of extreme-scale HPC applications. How to optimize the checkpoint intervals for each level, however, is an extremely difficult problem. In this paper, we construct an easy-to-use two-level checkpoint model. Checkpoint level 1 deals with errors with low checkpoint/recovery overheads such as transient memory errors, w...
International audienceThis paper investigates the optimal number of processors to execute a parallel...
The large scale of current and next-generation massively parallel processing (MPP) systems presents ...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...
International audience—The traditional single-level checkpointing method suffers from significant ov...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
Computational power demand for large challenging problems has increasingly driven the physical size ...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
International audienceFuture high performance computing systems will need to use novel techniques to...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceWith the emergence of versatile storage systems, multi-level checkpointing (ML...
International audienceThe high failure rate expected for future supercomputers requires the design o...
International audienceThis paper investigates the optimal number of processors to execute a parallel...
The large scale of current and next-generation massively parallel processing (MPP) systems presents ...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...
International audience—The traditional single-level checkpointing method suffers from significant ov...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
Computational power demand for large challenging problems has increasingly driven the physical size ...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
International audienceFuture high performance computing systems will need to use novel techniques to...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceWith the emergence of versatile storage systems, multi-level checkpointing (ML...
International audienceThe high failure rate expected for future supercomputers requires the design o...
International audienceThis paper investigates the optimal number of processors to execute a parallel...
The large scale of current and next-generation massively parallel processing (MPP) systems presents ...
In this paper, we design and analyze strategies to replicate the execution of an application on two ...