For checkpointing to be practical, it has to introduce low overhead for the targeted application. As a means of reducing the overhead of checkpointing, this paper proposes a probabilistic checkpointing method, which uses block encoding to detect the modified memory area between two consecutive checkpoints. Since the proposed technique uses block encoding to detect the modified area, the possibility of aliasing exists in encoded words. However, this paper shows that the aliasing probability is near zero when an 8-byte encoded word is used. The performance of the proposed technique is analyzed and measured by using experiments. An analytic model which predicts the checkpointing overhead is first constructed. By using this model, the block siz...
Fault-tolerant computer systems are increasingly being used in such applications as e-commerce, bank...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
International audienceFast checkpointing algorithms require distributed access to stable storage. Th...
Checkpointing is a pivotal technique in system research, with applications ranging from crash recove...
Checkpointing is a pivotal technique in system research, with applications ranging from crash recove...
In this paper we present compiler-assisted checkpointing, a new technique which uses static program ...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
High-frequency memory checkpointing is an important technique in several application domains, such a...
The overhead of saving checkpoints to stable storage is the dominant performance cost in checkpointi...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
Checkpointing is a common technique for reducing the time to recover from faults in computer systems...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Checkpointing is a common technique for reducing the time to recover from faults in computer systems...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
As modern supercomputing systems reach the peta-flop perfor-mance range, they grow in both size and ...
Fault-tolerant computer systems are increasingly being used in such applications as e-commerce, bank...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
International audienceFast checkpointing algorithms require distributed access to stable storage. Th...
Checkpointing is a pivotal technique in system research, with applications ranging from crash recove...
Checkpointing is a pivotal technique in system research, with applications ranging from crash recove...
In this paper we present compiler-assisted checkpointing, a new technique which uses static program ...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
High-frequency memory checkpointing is an important technique in several application domains, such a...
The overhead of saving checkpoints to stable storage is the dominant performance cost in checkpointi...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
Checkpointing is a common technique for reducing the time to recover from faults in computer systems...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Checkpointing is a common technique for reducing the time to recover from faults in computer systems...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
As modern supercomputing systems reach the peta-flop perfor-mance range, they grow in both size and ...
Fault-tolerant computer systems are increasingly being used in such applications as e-commerce, bank...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
International audienceFast checkpointing algorithms require distributed access to stable storage. Th...