Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerating such failures in that it allows applications to periodically save their state and restart the computation after a failure. Although a variety of automated system-level checkpointing solutions are currently available to HPC users, manual application-level checkpointing remains by far the most popular approach because of its superior performance. This paper focuses on improving the performance of automated checkpointing via a compiler analysis for incremental checkpointing. This analysis is shown...
A new transparent, incremental, concurrent checkpoint mechanism for real-time and interactive applic...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Abstract. As modern supercomputing systems reach the peta-flop per-formance range, they grow in both...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
In this paper we present compiler-assisted checkpointing, a new technique which uses static program ...
The contributions of this paper are the following. We describe the implementation of the $C^3$ syst...
The contributions of this paper are the following. • We describe the implementation of the C3 system...
The contributions of this paper are the following. • We describe the implementation of the C3 system...
Checkpointing support allows program execution to roll-back to an earlier program point, discarding ...
Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this,...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
A new transparent, incremental, concurrent checkpoint mechanism for real-time and interactive applic...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Abstract. As modern supercomputing systems reach the peta-flop per-formance range, they grow in both...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
In this paper we present compiler-assisted checkpointing, a new technique which uses static program ...
The contributions of this paper are the following. We describe the implementation of the $C^3$ syst...
The contributions of this paper are the following. • We describe the implementation of the C3 system...
The contributions of this paper are the following. • We describe the implementation of the C3 system...
Checkpointing support allows program execution to roll-back to an earlier program point, discarding ...
Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this,...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
A new transparent, incremental, concurrent checkpoint mechanism for real-time and interactive applic...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...