Next-generation exascale systems, those capable of performing a quintillion operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluat...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
ABSTRACT As high-end computing machines continue to grow in size, issues such as fault tolerance and...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
The current approach to resilience for large high-performance computing (HPC) machines is based on g...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
ABSTRACT As high-end computing machines continue to grow in size, issues such as fault tolerance and...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
The current approach to resilience for large high-performance computing (HPC) machines is based on g...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...