High-performance computing (HPC) requires resilience techniques such as checkpointing in order to tolerate failures in supercomputers. As the number of nodes and memory in supercomputers keeps on increasing, the size of checkpoint data also increases dramatically, sometimes causing an I/O bottleneck. Differential checkpointing (dCP) aims to minimize the checkpointing overhead by only writing data differences. This is typically implemented at the memory page level, sometimes complemented with hashing algorithms. However, such a technique is unable to cope with dynamic-size datasets. In this work, we present a novel dCP implementation with a new file format that allows fragmentation of protected datasets in order to support dynamic sizes. We ...
In this paper we present research on improving the resilience of the execution of scientific softwar...
High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, app...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
High-performance computing (HPC) requires resilience techniques such as checkpointing in order to to...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
© 2016 IEEE. HPC systems contain an increasing number of components, decreasing the mean time betwee...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
Abstract. Checkpoint/restart is a common technique deployed in the high-performance computing (HPC) ...
The efficient utilization of current supercomputing systems with deep storage hierarchies demands sc...
International audienceThe high failure rate expected for future supercomputers requires the design o...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
International audience—The traditional single-level checkpointing method suffers from significant ov...
Please refer to pdf.James Watt ScholarshipEngineering and Physical Sciences Research Council (EPSRC)...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
In this paper we present research on improving the resilience of the execution of scientific softwar...
High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, app...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
High-performance computing (HPC) requires resilience techniques such as checkpointing in order to to...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
© 2016 IEEE. HPC systems contain an increasing number of components, decreasing the mean time betwee...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
Abstract. Checkpoint/restart is a common technique deployed in the high-performance computing (HPC) ...
The efficient utilization of current supercomputing systems with deep storage hierarchies demands sc...
International audienceThe high failure rate expected for future supercomputers requires the design o...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
International audience—The traditional single-level checkpointing method suffers from significant ov...
Please refer to pdf.James Watt ScholarshipEngineering and Physical Sciences Research Council (EPSRC)...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
In this paper we present research on improving the resilience of the execution of scientific softwar...
High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, app...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...