In the past twenty years, there has been a wealth of theoretical research on minimizing the expected running time of a program in the presence of failures by employing checkpointing and rollback recovery. In the same time period, there has been little experimental research to corroborate these results. In this paper, we study the results of three separate projects that monitor failure in workstation networks. Our goals are twofold. The first is to see how these results correlate with the theoretical results, and the second is to assess their impact on strategies for checkpointing long-running computations on workstations and networks of workstations. A surprising result of our work is that although the base assumptions of the theoretical re...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpo...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
In the past twenty years, there has been a wealth of theoretical research on minimizing the expected...
Performance evaluation of checkpoint rollback recovery strategies for distributed systems is a field...
104 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 1998.A large number of checkpoint-...
104 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 1998.A large number of checkpoint-...
Checkpointing is today's common mean for dealing with transient failures in supercomputers. However,...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures th...
Checkpointing is today's common mean for dealing with transient failures in supercomputers. However,...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Cooperative checkpointing, in which the system dy-namically skips checkpoints requested by applicati...
This paper describes a checkpoint comparison and optimistic execution technique for error detection ...
It is known that check pointing and rollback recovery are widely used techniques that allow a distri...
This thesis studies a forward recovery strategy using checkpointing and optimistic execution in para...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpo...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
In the past twenty years, there has been a wealth of theoretical research on minimizing the expected...
Performance evaluation of checkpoint rollback recovery strategies for distributed systems is a field...
104 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 1998.A large number of checkpoint-...
104 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 1998.A large number of checkpoint-...
Checkpointing is today's common mean for dealing with transient failures in supercomputers. However,...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures th...
Checkpointing is today's common mean for dealing with transient failures in supercomputers. However,...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Cooperative checkpointing, in which the system dy-namically skips checkpoints requested by applicati...
This paper describes a checkpoint comparison and optimistic execution technique for error detection ...
It is known that check pointing and rollback recovery are widely used techniques that allow a distri...
This thesis studies a forward recovery strategy using checkpointing and optimistic execution in para...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpo...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...