This article describes the motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI. Application-level solutions, including both checkpointing and fault-tolerant algorithms, are recognized as more time and space efficient than system-level checkpoints, which cannot make use of any application-specific knowledge. However, system-level checkpointing allows for preemption, making it suitable for responding to fault precursors (for instance, elevated error rates from ECC memory or network CRCs, or elevated temperature from sensors). Preemption can also increase the e...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures th...
Checkpoint/recovery has been studied extensively, and various optimization techniques have been pres...
Abstract—Nowadays, clusters are widely used to execute scientific applications. These applications a...
This article describes the motivation, design and implementation of Berkeley Lab Checkpoint/Restart ...
This paper describes Berkeley Linux Checkpoint/Restart(BLCR), a linux kernel module that allows sys...
This paper describes Berkeley Linux Checkpoint/Restart (BLCR), a linux kernel module that allows sys...
As high performance computing centers (HPCC) continue to grow in popularity, issues of resource mana...
Abstract: Checkpointing is a procedure of storing process state to a file, which is later used to re...
Abstract — Nowadays, clusters are widely used to execute scientific applications. These applications...
As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) improves scientific prod...
The massive scale of current and next-generation massively parallel processing (MPP) systems present...
Abstract—As failure rate keeps on increasing in large systems, applications running atop restart mor...
Abstract. Checkpoint/restart is a common technique deployed in the high-performance computing (HPC) ...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures th...
Checkpoint/recovery has been studied extensively, and various optimization techniques have been pres...
Abstract—Nowadays, clusters are widely used to execute scientific applications. These applications a...
This article describes the motivation, design and implementation of Berkeley Lab Checkpoint/Restart ...
This paper describes Berkeley Linux Checkpoint/Restart(BLCR), a linux kernel module that allows sys...
This paper describes Berkeley Linux Checkpoint/Restart (BLCR), a linux kernel module that allows sys...
As high performance computing centers (HPCC) continue to grow in popularity, issues of resource mana...
Abstract: Checkpointing is a procedure of storing process state to a file, which is later used to re...
Abstract — Nowadays, clusters are widely used to execute scientific applications. These applications...
As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) improves scientific prod...
The massive scale of current and next-generation massively parallel processing (MPP) systems present...
Abstract—As failure rate keeps on increasing in large systems, applications running atop restart mor...
Abstract. Checkpoint/restart is a common technique deployed in the high-performance computing (HPC) ...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures th...
Checkpoint/recovery has been studied extensively, and various optimization techniques have been pres...
Abstract—Nowadays, clusters are widely used to execute scientific applications. These applications a...