Current approaches for checkpointing and recovery assume system homogeneity, where checkpointing and recovery are both performed on the same processor architecture and operating system configuration. Sometimes it is desirable or necessary to recover the failed computation on a different processor architecture, with possibly different byte-ordering and data-alignment specifications. This implies that checkpointing and recovery must be portable. We provide portability by means of a universal checkpoint format that allows object codes to resume execution from a checkpointed state, allowing for fast execution of already compiled code, rather than interpreting or compiling on the fly. This paper describes the system support needed to implement ...
In this work, we present a high performance recovery algorithm for distributed systems in which chec...
This paper describes a checkpoint comparison and optimistic execution technique for error detection ...
We consider the problem of bringing a distributed system to a consistent state after transient fail...
Checkpointing in a homogeneous environment, where both checkpointing and recovery are performed on t...
Traditional checkpoint and recovery are based upon two basic assumptions. The first is the need to h...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
Checkpointing is a pivotal technique in system research, with applications ranging from crash recove...
Checkpointing is a pivotal technique in system research, with applications ranging from crash recove...
This thesis studies a forward recovery strategy using checkpointing and optimistic execution in para...
This work proposes some generic approaches to offer transparency and efficiency to checkpoint-recove...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Abstract. We propose a generalized forward recovery checkpointing scheme, with lookahead execution a...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
To provide fault tolerance to computer systems suffering from transient faults, checkpointing and ro...
In this work, we present a high performance recovery algorithm for distributed systems in which chec...
This paper describes a checkpoint comparison and optimistic execution technique for error detection ...
We consider the problem of bringing a distributed system to a consistent state after transient fail...
Checkpointing in a homogeneous environment, where both checkpointing and recovery are performed on t...
Traditional checkpoint and recovery are based upon two basic assumptions. The first is the need to h...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
Checkpointing is a pivotal technique in system research, with applications ranging from crash recove...
Checkpointing is a pivotal technique in system research, with applications ranging from crash recove...
This thesis studies a forward recovery strategy using checkpointing and optimistic execution in para...
This work proposes some generic approaches to offer transparency and efficiency to checkpoint-recove...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Abstract. We propose a generalized forward recovery checkpointing scheme, with lookahead execution a...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
To provide fault tolerance to computer systems suffering from transient faults, checkpointing and ro...
In this work, we present a high performance recovery algorithm for distributed systems in which chec...
This paper describes a checkpoint comparison and optimistic execution technique for error detection ...
We consider the problem of bringing a distributed system to a consistent state after transient fail...