Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than one hour. At such low MTBFs, employing periodic checkpointing alone will result in low efficiency because of the high number of application failures resulting in large amount of lost work due to rollbacks. In such scenarios, it is highly necessary to have proactive fault tolerance mechanisms that can help avoid significant number of failures. In this work, we have developed a mechanism for proactive fault tolerance using partial replication of a set of application processes. Our fault tolerance framework adaptively changes the set of replicated processes periodically based on failure predictions to avoid failures. We have developed an MPI pro...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
ABSTRACT As high-end computing machines continue to grow in size, issues such as fault tolerance and...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Exascale systems of the future are predicted to have mean time between node failures (MTBF) of less ...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
Abstract — Application robustness becomes a major concern with the continued scaling of high perform...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Processor failures in post-petascale settings are common occurrences. The traditional fault-toleranc...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
ABSTRACT As high-end computing machines continue to grow in size, issues such as fault tolerance and...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Exascale systems of the future are predicted to have mean time between node failures (MTBF) of less ...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
Abstract — Application robustness becomes a major concern with the continued scaling of high perform...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Processor failures in post-petascale settings are common occurrences. The traditional fault-toleranc...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
ABSTRACT As high-end computing machines continue to grow in size, issues such as fault tolerance and...