International audienceWith increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence, which makes reliability a difficult challenge. Although for some applications it is enough to restart failed tasks, there is a large class of applications where tasks run for a long time or are tightly coupled, thus making a restart from scratch unfeasible. Checkpoint-Restart (CR), the main method to survive failures for such applications faces additional challenges in this context: not only does it need to minimize the performance overhead on the application due to checkpointing, but it also needs to operate with scarce resources. Given the iterative nature of the targeted applications, ...
International audienceAlgorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalabil...
International audienceFast checkpointing algorithms require distributed access to stable storage. Th...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceIn this paper, we aim at optimizing fault-tolerance tech- niques based on a ch...
International audienceThis work deals with scheduling and checkpointing strategies to execute scient...
International audienceThe ever-increasing number of computation units assembled in current HPC platf...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
International audienceAlgorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalabil...
International audienceFast checkpointing algorithms require distributed access to stable storage. Th...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceIn this paper, we aim at optimizing fault-tolerance tech- niques based on a ch...
International audienceThis work deals with scheduling and checkpointing strategies to execute scient...
International audienceThe ever-increasing number of computation units assembled in current HPC platf...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
International audienceAlgorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalabil...
International audienceFast checkpointing algorithms require distributed access to stable storage. Th...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...