Employing fault tolerance often introduces a time overhead, which may cause a deadline violation in real-time systems (RTS). Therefore, for RTS it is important to optimize the fault tolerance techniques such that the probability to meet the deadlines, i.e. The Level of Confidence (LoC), is maximized. Previous studies have focused on evaluating the LoC for equidistant checkpointing. However, no studies have addressed the problem of evaluating the LoC for non-equidistant checkpointing. In this work, we provide an expression to evaluate the LoC for non-equidistant checkpointing, and propose the Clustered Checkpointing method that distributes a given number of checkpoints with the goal to maximize the LoC. The results show that the LoC can be i...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
Resilience has become a critical problem for high performance computing. Checkpointing proto-cols ar...
Checkpointing is a very well known mechanism to achieve fault tolerance. In distributed applications...
Employing fault tolerance often introduces a time overhead, which may cause a deadline violation in ...
To combat the increasing soft error rates in recent semiconductor technologies, it is important to e...
Correct operation of real-time systems (RTS) is defined as producing correct results within given ti...
For the vast majority of computer systems correct operation is defined as producing the correct resu...
Increasing soft error rates in recent semiconductor technologies enforce the usage of fault toleranc...
The application of checkpointing as a fault-tolerance measure for real-time services (i.e., services...
The application of checkpointing as a fault-tolerance measure for real-time services (i.e., services...
Cooperative checkpointing uses global knowledge of the state and health of the machine to improve pe...
Checkpointing is a fault tolerance technique widely used in various types of computer systems. In ch...
Checkpointing is a very well known mechanism to achieve fault tolerance. In distributed applications...
The probability for errors to occur in electronic systems is not known in advance, but depends on ma...
AbstractCheckpointing mechanism is used to tolerate the impact of transient faults by rollback opera...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
Resilience has become a critical problem for high performance computing. Checkpointing proto-cols ar...
Checkpointing is a very well known mechanism to achieve fault tolerance. In distributed applications...
Employing fault tolerance often introduces a time overhead, which may cause a deadline violation in ...
To combat the increasing soft error rates in recent semiconductor technologies, it is important to e...
Correct operation of real-time systems (RTS) is defined as producing correct results within given ti...
For the vast majority of computer systems correct operation is defined as producing the correct resu...
Increasing soft error rates in recent semiconductor technologies enforce the usage of fault toleranc...
The application of checkpointing as a fault-tolerance measure for real-time services (i.e., services...
The application of checkpointing as a fault-tolerance measure for real-time services (i.e., services...
Cooperative checkpointing uses global knowledge of the state and health of the machine to improve pe...
Checkpointing is a fault tolerance technique widely used in various types of computer systems. In ch...
Checkpointing is a very well known mechanism to achieve fault tolerance. In distributed applications...
The probability for errors to occur in electronic systems is not known in advance, but depends on ma...
AbstractCheckpointing mechanism is used to tolerate the impact of transient faults by rollback opera...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
Resilience has become a critical problem for high performance computing. Checkpointing proto-cols ar...
Checkpointing is a very well known mechanism to achieve fault tolerance. In distributed applications...