Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a ...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
Next-generation exascale systems, those capable of performing a quintillion operations per second, ...
ABSTRACT As high-end computing machines continue to grow in size, issues such as fault tolerance and...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
The current approach to resilience for large high-performance computing (HPC) machines is based on g...
This paper revisits replication coupled with checkpointing for fail-stop errors.Replication enables ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
Next-generation exascale systems, those capable of performing a quintillion operations per second, ...
ABSTRACT As high-end computing machines continue to grow in size, issues such as fault tolerance and...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
The current approach to resilience for large high-performance computing (HPC) machines is based on g...
This paper revisits replication coupled with checkpointing for fail-stop errors.Replication enables ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceLarge scale applications running on new computing plat- forms with thousands o...