This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bigger and bigger in order to reach what we call exascale, i.e. a computing capacity of 10^18 FLOP/s but they suffer numerous failures. Reducing the execution time and handling the errors are two linked problems: for instance, replication (computing redudancy) decreases the number of critical failures but also decreases the number of available resources. In particular, this thesis focuses on several “checkpoint/restart” mechanisms.(saving the state of an application to restart from that save when a failure occurs): the first part investigates checkpointing on several levels, the use of additional resources to cope with system latency and che...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
Cette thèse se concentre sur un problème majeur dans le contexte du calcul haute performance : la ré...
This thesis deals with two issues for future Exascale platforms, namely resilience and energy. We ad...
This thesis deals with two issues for future Exascale platforms, namely resilience and energy. We ad...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
Le passage de l'échelle des nouvelles plates-formes de calcul parallèle et distribué soulève de nomb...
Processor failures in post-petascale settings are common occurrences. The traditional fault-toleranc...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...
Les avancées technologiques ont conduit les grandes organisations telles que les entreprises,les uni...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
Cette thèse se concentre sur un problème majeur dans le contexte du calcul haute performance : la ré...
This thesis deals with two issues for future Exascale platforms, namely resilience and energy. We ad...
This thesis deals with two issues for future Exascale platforms, namely resilience and energy. We ad...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
Le passage de l'échelle des nouvelles plates-formes de calcul parallèle et distribué soulève de nomb...
Processor failures in post-petascale settings are common occurrences. The traditional fault-toleranc...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...
In this article, we present a unified model for several well-known checkpoint/restart protocols. The...
Les avancées technologiques ont conduit les grandes organisations telles que les entreprises,les uni...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...