International audienceWith increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence, which makes reliability a difficult challenge. Although for some applications it is enough to restart failed tasks, there is a large class of applications where tasks run for a long time or are tightly coupled, thus making a restart from scratch unfeasible. Checkpoint-Restart (CR), the main method to survive failures for such applications faces additional challenges in this context: not only does it need to minimize the performance overhead on the application due to checkpointing, but it also needs to operate with scarce resources. Given the iterative nature of the targeted applications, ...
International audienceWith the emergence of versatile storage systems, multi-level checkpointing (ML...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
A new transparent, incremental, concurrent checkpoint mechanism for real-time and interactive applic...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
High-frequency memory checkpointing is an important technique in several application domains, such a...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Checkpointing has been widely adopted in support of fault-tolerance and job migration essential for ...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
International audienceWith the emergence of versatile storage systems, multi-level checkpointing (ML...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
A new transparent, incremental, concurrent checkpoint mechanism for real-time and interactive applic...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
High-frequency memory checkpointing is an important technique in several application domains, such a...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Checkpointing has been widely adopted in support of fault-tolerance and job migration essential for ...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
International audienceWith the emergence of versatile storage systems, multi-level checkpointing (ML...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
textTo make progress in the face of failures, long-running parallel applications need to save their ...