As modern supercomputing systems reach the peta-flop perfor-mance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerating such failures, enabling applications to periodically save their state and restart computation after a failure. Although a variety of automated system-level checkpointing solutions are currently available to HPC users, manual application-level checkpointing remains more popular due to its superior performance. This paper improves per-formance of automated checkpointing via a compiler analysis for incremental checkpointing. This analysis, which works with both sequential and OpenMP applications, reduces c...
The contributions of this paper are the following. We describe the implementation of the $C^3$ syst...
The contributions of this paper are the following. • We describe the implementation of the C3 system...
Checkpointing tools may be typically implemented at two different abstraction levels: at the system ...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this,...
AbstractAs parallel machines increase their number of processors, so does the failure rate of the gl...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing f...
In this paper we present compiler-assisted checkpointing, a new technique which uses static program ...
Abstract. As modern supercomputing systems reach the peta-flop per-formance range, they grow in both...
A new transparent, incremental, concurrent checkpoint mechanism for real-time and interactive applic...
Trends in high-performance computing are making it nec-essary for long-running applications to toler...
The contributions of this paper are the following. We describe the implementation of the $C^3$ syst...
The contributions of this paper are the following. • We describe the implementation of the C3 system...
Checkpointing tools may be typically implemented at two different abstraction levels: at the system ...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this,...
AbstractAs parallel machines increase their number of processors, so does the failure rate of the gl...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing f...
In this paper we present compiler-assisted checkpointing, a new technique which uses static program ...
Abstract. As modern supercomputing systems reach the peta-flop per-formance range, they grow in both...
A new transparent, incremental, concurrent checkpoint mechanism for real-time and interactive applic...
Trends in high-performance computing are making it nec-essary for long-running applications to toler...
The contributions of this paper are the following. We describe the implementation of the $C^3$ syst...
The contributions of this paper are the following. • We describe the implementation of the C3 system...
Checkpointing tools may be typically implemented at two different abstraction levels: at the system ...