As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerating such failures, enablingapplications to periodically save their state and restart computation after a failure. Although a many automated system-level checkpointing solutions are currently availableto HPC users, manual application-level checkpointing remains more popular due to its superior performance. This paper improves performance of automated checkpointing via a compiler analysis for incremental checkpointing.This analysis, which works with both sequential and OpenMP applications, reduces checkpoint s...
Checkpointing tools may be typically implemented at two different abstraction levels: at the system ...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
The contributions of this paper are the following. We describe the implementation of the $C^3$ syst...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this,...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing f...
AbstractAs parallel machines increase their number of processors, so does the failure rate of the gl...
In this paper we present compiler-assisted checkpointing, a new technique which uses static program ...
A new transparent, incremental, concurrent checkpoint mechanism for real-time and interactive applic...
Trends in high-performance computing are making it nec-essary for long-running applications to toler...
Abstract:- Checkpoint is defined as a designated place in a program at which normal processing is in...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Checkpointing tools may be typically implemented at two different abstraction levels: at the system ...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
The contributions of this paper are the following. We describe the implementation of the $C^3$ syst...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this,...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing f...
AbstractAs parallel machines increase their number of processors, so does the failure rate of the gl...
In this paper we present compiler-assisted checkpointing, a new technique which uses static program ...
A new transparent, incremental, concurrent checkpoint mechanism for real-time and interactive applic...
Trends in high-performance computing are making it nec-essary for long-running applications to toler...
Abstract:- Checkpoint is defined as a designated place in a program at which normal processing is in...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Checkpointing tools may be typically implemented at two different abstraction levels: at the system ...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
The contributions of this paper are the following. We describe the implementation of the $C^3$ syst...