As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerating such failures, enabling applications to periodically save their state and restart computation after a failure. Although a variety of automated system-level checkpointing solutions are currently available to HPC users, manual application-level checkpointing remains more popular due to its superior performance. This paper improves performance of automated checkpointing via a compiler analysis for incremental checkpointing. This analysis, which works with both sequential and OpenMP applications, reduces che...
The contributions of this paper are the following. We describe the implementation of the $C^3$ syst...
The use of FPGAs in computational workloads is becoming increasingly popular due to the flexibility ...
A new transparent, incremental, concurrent checkpoint mechanism for real-time and interactive applic...
As modern supercomputing systems reach the peta-flop perfor-mance range, they grow in both size and ...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
[Abstract] Despite the increasing popularity of shared-memory systems, there is a lack of tools for ...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this,...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
© 2016 IEEE. HPC systems contain an increasing number of components, decreasing the mean time betwee...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
The contributions of this paper are the following. We describe the implementation of the $C^3$ syst...
The use of FPGAs in computational workloads is becoming increasingly popular due to the flexibility ...
A new transparent, incremental, concurrent checkpoint mechanism for real-time and interactive applic...
As modern supercomputing systems reach the peta-flop perfor-mance range, they grow in both size and ...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
[Abstract] Despite the increasing popularity of shared-memory systems, there is a lack of tools for ...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this,...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
© 2016 IEEE. HPC systems contain an increasing number of components, decreasing the mean time betwee...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
The contributions of this paper are the following. We describe the implementation of the $C^3$ syst...
The use of FPGAs in computational workloads is becoming increasingly popular due to the flexibility ...
A new transparent, incremental, concurrent checkpoint mechanism for real-time and interactive applic...