In this article, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the check- point/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strate- gies (with message logging). We identify a set of crucial parameters, instantiate them and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available HPC platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, th...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...