With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often addressed by work-stealing of tasks, which is the target of two recent resilience techniques. The first adopts application-level checkpointing, it keeps checkpoints consistent after steals. The second adopts supervision in combination with steal tracking, it lets parent tasks supervise/restart their children and identifies intact subtasks from distributed history information. These techniques have been designed for different task models. This paper transfers steal tracking to the other task model, thus enabling a comparison. Contributions include the choice of supervisors and the def...
Exascale platforms require programming models incorporating support for resilience capabilities sinc...
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the red...
Resilience is an important challenge for extreme-scale supercomputers. Failures in current supercomp...
With the advent of exascale computing, issues such as application irregularity and permanent hardwar...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
Abstract—Fault-tolerance poses a major challenge for future large-scale systems. Active research int...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
International audienceThis work deals with scheduling and checkpointing strategies to execute scient...
The current approach to resilience for large high-performance computing (HPC) machines is based on g...
International audienceThis article revisits checkpointing strategies when workflows composed of mult...
International audienceAlgorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalabil...
Exascale platforms require programming models incorporating support for resilience capabilities sinc...
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the red...
Resilience is an important challenge for extreme-scale supercomputers. Failures in current supercomp...
With the advent of exascale computing, issues such as application irregularity and permanent hardwar...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
Abstract—Fault-tolerance poses a major challenge for future large-scale systems. Active research int...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
International audienceThis work deals with scheduling and checkpointing strategies to execute scient...
The current approach to resilience for large high-performance computing (HPC) machines is based on g...
International audienceThis article revisits checkpointing strategies when workflows composed of mult...
International audienceAlgorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalabil...
Exascale platforms require programming models incorporating support for resilience capabilities sinc...
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the red...
Resilience is an important challenge for extreme-scale supercomputers. Failures in current supercomp...