As computational clusters rapidly grow in both size and complexity, system reliability and, in particular, applica-tion resilience have become increasingly important factors to consider in maintaining efficiency and providing improved computational performance over predecessor systems. One commonly used mechanism for providing application fault tolerance in parallel systems is the use of checkpointing. By making use of a multi-cluster simulator, we study the impact of sub-optimal checkpoint intervals on overall appli-cation efficiency. By using a model of a 1926 node cluster and workload statistics from Los Alamos National Labora-tory to parameterize the simulator, we find that dramati-cally overestimating the AMTTI has a fairly minor impac...
The large scale of current and next-generation massively parallel processing (MPP) systems presents ...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures th...
The massive scale of current and next-generation massively parallel processing (MPP) systems present...
Performance prediction of checkpointing systems in the presence of failures is a well-studied resear...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
The massive scale of current and next-generation massively parallel processing (MPP) systems present...
Parallel computing systems provide hardware redundancy that helps to achieve low cost fault- toleran...
The large scale of current and next-generation massively parallel processing (MPP) systems presents ...
Parallel computing systems provide hardware redundancy that helps to achieve low cost faulttolerance...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
The large scale of current and next-generation massively parallel processing (MPP) systems presents ...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures th...
The massive scale of current and next-generation massively parallel processing (MPP) systems present...
Performance prediction of checkpointing systems in the presence of failures is a well-studied resear...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
The massive scale of current and next-generation massively parallel processing (MPP) systems present...
Parallel computing systems provide hardware redundancy that helps to achieve low cost fault- toleran...
The large scale of current and next-generation massively parallel processing (MPP) systems presents ...
Parallel computing systems provide hardware redundancy that helps to achieve low cost faulttolerance...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
The large scale of current and next-generation massively parallel processing (MPP) systems presents ...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...