International audienceFailures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue. We then propose an extended model of uncoordinated checkpointing that c...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
International audienceFuture high performance computing systems will need to use novel techniques to...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
International audience— As reported by many recent studies, the mean time between failures of future...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
International audience—The traditional single-level checkpointing method suffers from significant ov...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
International audienceFuture high performance computing systems will need to use novel techniques to...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
International audience— As reported by many recent studies, the mean time between failures of future...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
International audience—The traditional single-level checkpointing method suffers from significant ov...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
International audienceFuture high performance computing systems will need to use novel techniques to...