Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing clusters and computa-tional grids. Checkpoint data can be saved either in central stable storage, or in processor memory (as in diskless checkpointing), or local disk space (replacing memory with local disk in diskless checkpointing). But where to save the checkpoint data has a great impact on the performance of a checkpointing scheme. Fault tolerance schemes with higher efficiency usually choose to save the checkpoint data closer to the processor. However, when failures are handled from application level, the storage hierarch of a platform is often not available at the fault tolerance scheme design time. Therefore, it is often difficult to decide ...
International audienceAs high performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceParallel execution time is expected to decrease as the number of processors in...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the red...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
As the size of high performance clusters multiplies, the prob-ability of system failure grows substa...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
International audienceAn alternative to classical fault-tolerant approaches for large-scale clusters...
Scientific workflows are data- and compute-intensive; thus, they may run for days or even weeks...
An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoid-ance...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
International audienceAs high performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceParallel execution time is expected to decrease as the number of processors in...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the red...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
As the size of high performance clusters multiplies, the prob-ability of system failure grows substa...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
International audienceAn alternative to classical fault-tolerant approaches for large-scale clusters...
Scientific workflows are data- and compute-intensive; thus, they may run for days or even weeks...
An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoid-ance...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
International audienceAs high performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceParallel execution time is expected to decrease as the number of processors in...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...