Coordinated checkpointing is a well-known method to achieve fault tolerance in distributed systems. Long running parallel applications and high-availability applications are two potential users of checkpointing, although with different requirements. Parallel applications need low failure-free overheads, and high-availability applications require fast and bounded recoveries. In this paper, we describe a new coordinated checkpoint protocol capable of satisfying both types of applications. The protocol uses time to avoid all types of direct coordination (e.g., message exchanges and message tagging), reducing the overheads to almost a min-imum. To ensure that rapid recoveries can be attained, the protocol guarantees small checkpoint latencies. ...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
104 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 1998.A large number of checkpoint-...
104 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 1998.A large number of checkpoint-...
In this paper, we describe an efficient coordinated-checkpointing and recovery algorithm which can w...
This paper presents a new checkpointing algorithm for systems using reliable communication channels....
This paper presents a new checkpointing algorithm for systems using reliable communication channels....
The ever increasing number of processors used in parallel computers is making fault tolerance suppor...
This paper presents a new checkpointing coordination scheme which utilizes the communication pattern...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
International audienceA long-term trend in high-performance computing is the increasing number of no...
International audienceA long-term trend in high-performance computing is the increasing number of no...
International audienceA long-term trend in high-performance computing is the increasing number of no...
International audienceA long-term trend in high-performance computing is the increasing number of no...
In order to provide fault tolerance for distributed systems, the checkpointing technique has widely ...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
104 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 1998.A large number of checkpoint-...
104 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 1998.A large number of checkpoint-...
In this paper, we describe an efficient coordinated-checkpointing and recovery algorithm which can w...
This paper presents a new checkpointing algorithm for systems using reliable communication channels....
This paper presents a new checkpointing algorithm for systems using reliable communication channels....
The ever increasing number of processors used in parallel computers is making fault tolerance suppor...
This paper presents a new checkpointing coordination scheme which utilizes the communication pattern...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
International audienceA long-term trend in high-performance computing is the increasing number of no...
International audienceA long-term trend in high-performance computing is the increasing number of no...
International audienceA long-term trend in high-performance computing is the increasing number of no...
International audienceA long-term trend in high-performance computing is the increasing number of no...
In order to provide fault tolerance for distributed systems, the checkpointing technique has widely ...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
104 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 1998.A large number of checkpoint-...
104 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 1998.A large number of checkpoint-...