The probability for errors to occur in electronic systems is not known in advance, but depends on many factors including influence from the environment where the system operates. In this paper, it is demonstrated that inaccurate estimates of the error probability lead to loss of performance in a well known fault tolerance technique, Roll-back Recovery with checkpointing (RRC). To regain the lost performance, a method for estimating the error probability along with an adjustment technique are proposed. Using a simulator tool that has been developed to enable experimentation, the proposed method is evaluated and the results show that the proposed method provides useful estimates of the error probability leading to near-optimal performance of ...
Abstract: Finding the failure rate of a system is a crucial step in high performance comput-ing syst...
Abstract Index-based checkpointing allows the use of simple and efficient algorithms for dom-ino-eff...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
The probability for errors to occur in electronic systems is not known in advance, but depends on ma...
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpo...
To combat the increasing soft error rates in recent semiconductor technologies, it is important to e...
For the vast majority of computer systems correct operation is defined as producing the correct resu...
Increasing soft error rates in recent semiconductor technologies enforce the usage of fault toleranc...
Correct operation of real-time systems (RTS) is defined as producing correct results within given ti...
This paper describes a checkpoint comparison and optimistic execution technique for error detection ...
Checkpoint-based rollback recovery is a very popular category of fault toler-ance techniques, which ...
P(論文)In this paper, we treat the checkpoint policies taking account of unsuccessful rollback recover...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Abstract: Finding the failure rate of a system is a crucial step in high performance comput-ing syst...
Abstract Index-based checkpointing allows the use of simple and efficient algorithms for dom-ino-eff...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
The probability for errors to occur in electronic systems is not known in advance, but depends on ma...
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpo...
To combat the increasing soft error rates in recent semiconductor technologies, it is important to e...
For the vast majority of computer systems correct operation is defined as producing the correct resu...
Increasing soft error rates in recent semiconductor technologies enforce the usage of fault toleranc...
Correct operation of real-time systems (RTS) is defined as producing correct results within given ti...
This paper describes a checkpoint comparison and optimistic execution technique for error detection ...
Checkpoint-based rollback recovery is a very popular category of fault toler-ance techniques, which ...
P(論文)In this paper, we treat the checkpoint policies taking account of unsuccessful rollback recover...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Abstract: Finding the failure rate of a system is a crucial step in high performance comput-ing syst...
Abstract Index-based checkpointing allows the use of simple and efficient algorithms for dom-ino-eff...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...