The ever-increasing number of computation units assembled in current HPC platforms leads to a concerning increase in fault probability. Traditional checkpoint/restart strategies avoid wasting large amounts of computation time when such fault occurs. With the increasing amount of data dealt with by current applications, these strategies however suffer from their data transfer demand becoming unreasonable, or the entailed global synchronizations. Meanwhile, the current trend towards task-based programming is an opportunity to revisit the principles of the checkpoint/restart strategies. We here propose a checkpointing scheme which is closely tied to the execution of task graphs. We describe how it allows for completely asynchronous and distrib...
International audienceDriven by increasing core count and decreasing mean-time-to-failure in superco...
Current checkpointing techniques employed to overcome faults for HPC applications result in inferior...
Checkpoints that store intermediate results of computation have a fundamental impact on the computin...
International audienceThe ever-increasing number of computation units assembled in current HPC platf...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
International audienceThis work deals with scheduling and checkpointing strategies to execute scient...
Abstract: Checkpointing is a procedure of storing process state to a file, which is later used to re...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
International audienceGlobal checkpointing to external storage (e.g., a parallel file system) is a c...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) improves scientific prod...
The efficient utilization of current supercomputing systems with deep storage hierarchies demands sc...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
International audienceDriven by increasing core count and decreasing mean-time-to-failure in superco...
Current checkpointing techniques employed to overcome faults for HPC applications result in inferior...
Checkpoints that store intermediate results of computation have a fundamental impact on the computin...
International audienceThe ever-increasing number of computation units assembled in current HPC platf...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
International audienceThis work deals with scheduling and checkpointing strategies to execute scient...
Abstract: Checkpointing is a procedure of storing process state to a file, which is later used to re...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
International audienceGlobal checkpointing to external storage (e.g., a parallel file system) is a c...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) improves scientific prod...
The efficient utilization of current supercomputing systems with deep storage hierarchies demands sc...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
International audienceDriven by increasing core count and decreasing mean-time-to-failure in superco...
Current checkpointing techniques employed to overcome faults for HPC applications result in inferior...
Checkpoints that store intermediate results of computation have a fundamental impact on the computin...