Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a common technique for mitigating the amount of work lost due to job failures, but its effectiveness under realistic circumstances has not been studied. In this paper, we analyze the system-level performance of periodic application checkpointing using parameters similar to those projected for BlueGene/L systems. Our results reflect simulations on a toroidal interconnect architecture, using a real job log from a machine similar to BlueGene/L, and with a real failure distribution from a large-scale cluster. Our simulation studies investigate the impact of parameters such as...
In the past twenty years, there has been a wealth of theoretical research on minimizing the expected...
Abstract—Tenant level checkpointing is a novel fault-tolerance technique proposed in our previous wo...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
As computational clusters rapidly grow in both size and complexity, system reliability and, in parti...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
The massive scale of current and next-generation massively parallel processing (MPP) systems present...
The large scale of current and next-generation massively parallel processing (MPP) systems presents ...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
The massive scale of current and next-generation massively parallel processing (MPP) systems present...
Due to the character of the original source materials and the nature of batch digitization, quality ...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoid-ance...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
In the past twenty years, there has been a wealth of theoretical research on minimizing the expected...
Abstract—Tenant level checkpointing is a novel fault-tolerance technique proposed in our previous wo...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
As computational clusters rapidly grow in both size and complexity, system reliability and, in parti...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
The massive scale of current and next-generation massively parallel processing (MPP) systems present...
The large scale of current and next-generation massively parallel processing (MPP) systems presents ...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
The massive scale of current and next-generation massively parallel processing (MPP) systems present...
Due to the character of the original source materials and the nature of batch digitization, quality ...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoid-ance...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
In the past twenty years, there has been a wealth of theoretical research on minimizing the expected...
Abstract—Tenant level checkpointing is a novel fault-tolerance technique proposed in our previous wo...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...