Resilience is an important challenge for extreme-scale supercomputers. Failures in current supercomputers are assumed to be uniformly distributed in time. However, recent studies show that failures in high-performance computing systems are partially correlated in time, generating periods of higher failure density. The detection of those periods is important in order to adjust the system to new conditions. In this paper we present a monitoring system that listens to hardware events across computing nodes and forwards important events to the fault tolerance runtime so it can react to those regime changes. Our evaluation at scale shows several aspects of this dynamic checkpointing scheme, critical to understanding its applicability on producti...
Abstract—As the scale of high performance computing (HPC) continues to grow, application fault resil...
The era of petascale computing brought machines with hundreds of thousands of processors. The next g...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
Abstract—Fault-tolerance poses a major challenge for future large-scale systems. Active research int...
International audienceAn alternative to classical fault-tolerant approaches for large-scale clusters...
An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoid-ance...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures th...
This paper presents a scalable, adaptive and time-bounded general approach to assure reliable, real-...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
Abstract—As the scale of high performance computing (HPC) continues to grow, application fault resil...
The era of petascale computing brought machines with hundreds of thousands of processors. The next g...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
Abstract—Fault-tolerance poses a major challenge for future large-scale systems. Active research int...
International audienceAn alternative to classical fault-tolerant approaches for large-scale clusters...
An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoid-ance...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures th...
This paper presents a scalable, adaptive and time-bounded general approach to assure reliable, real-...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
Abstract—As the scale of high performance computing (HPC) continues to grow, application fault resil...
The era of petascale computing brought machines with hundreds of thousands of processors. The next g...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...