International audienceParallel execution time is expected to decrease as the number of processors increases. We show in this chapter that this is not as easy as it seems, even for perfectly parallel applications. In particular, processors are subject to faults. The more processors are available, the more likely faults will strike during execution. The main strategy to cope with faults in High Performance Computing is checkpointing. We introduce the reader to this approach, and explain how to determine the optimal checkpointing period through scheduling techniques. We also detail how to combine checkpointing with prediction and with replication
Safety-critical applications have to function correctly even in presence of faults. This thesis deal...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
Researchers have mentioned that the three most difficult and growing problems in the future of high-...
This report provides an introduction to the design of scheduling algorithms to cope with faults on l...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
International audienceThis work deals with scheduling and checkpointing strategies to execute scient...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
The large scale of current and next-generation massively parallel processing (MPP) systems presents ...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
As grids typically consist of autonomously managed subsystems with strongly varying resources, fault...
The massive scale of current and next-generation massively parallel processing (MPP) systems present...
Safety-critical applications have to function correctly even in presence of faults. This thesis deal...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
Researchers have mentioned that the three most difficult and growing problems in the future of high-...
This report provides an introduction to the design of scheduling algorithms to cope with faults on l...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
International audienceThis work deals with scheduling and checkpointing strategies to execute scient...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
The large scale of current and next-generation massively parallel processing (MPP) systems presents ...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
As grids typically consist of autonomously managed subsystems with strongly varying resources, fault...
The massive scale of current and next-generation massively parallel processing (MPP) systems present...
Safety-critical applications have to function correctly even in presence of faults. This thesis deal...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
Researchers have mentioned that the three most difficult and growing problems in the future of high-...