Many real-time applications will have strict reliability requirements in addition to the timing requirements. To fulfill these reliability requirements, it may be necessary to use a fault-tolerance strategy. An active replication strategy, where several instances of the task is run in parallel, is the preferred choice for many real-time systems, as the parallel execution of the task instances gives a high probability that at least some of the instances finish successfully before the deadlines, even if others should fail. However, enabling several parallel executions of single tasks increase the need for processing power, which is costly and increases the requirements to space and energy consumption. In a passive replication strategy, only...
Designing a distributed fault tolerance algorithm re-quires careful analysis of both fault models an...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
Task replication has been advocated as a practical solution to reduce response times in parallel sys...
Many real-time applications will have strict reliability requirements in addition to the timing requ...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
Process replication is provided as the central mechanism for application level software fault tolera...
It is imperative to accept that failures can and will occur even in meticulously designed distribute...
Traditional active and passive replication schemes are widely used to provide fault tolerant distrib...
Designing a distributed fault tolerance algorithm requires careful analysis of both fault models and...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
This thesis consists of two parts: performance bounds for scheduling algorithms for parallel program...
Clock-related operations are one of the many sources of replica non-determinism and of replica incon...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Designing a distributed fault tolerance algorithm re-quires careful analysis of both fault models an...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
Task replication has been advocated as a practical solution to reduce response times in parallel sys...
Many real-time applications will have strict reliability requirements in addition to the timing requ...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
Process replication is provided as the central mechanism for application level software fault tolera...
It is imperative to accept that failures can and will occur even in meticulously designed distribute...
Traditional active and passive replication schemes are widely used to provide fault tolerant distrib...
Designing a distributed fault tolerance algorithm requires careful analysis of both fault models and...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
This thesis consists of two parts: performance bounds for scheduling algorithms for parallel program...
Clock-related operations are one of the many sources of replica non-determinism and of replica incon...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Designing a distributed fault tolerance algorithm re-quires careful analysis of both fault models an...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
Task replication has been advocated as a practical solution to reduce response times in parallel sys...