The utilization of new generation computing platforms like computational grids or desktop grids introduces new challenging problems. In particular, due to the huge number of the involved processors, security and fault-tolerance aspects are key issues that must be taken into account. Coordinated checkpointing is one of the most popular technique to deal with failures in such platforms. The approach of application-directed checkpointing in fault-tolerance puts an incredible strain on the storage system and the communications. This results in large overheads on the execution times of applications that severely impact the performance and the scalability. This work presents a new model of coordinated checkpoint/restart mechanism for several type...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
Processor failures in post-petascale settings are common occurrences. The traditional fault-toleranc...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
Computational power demand for large challenging problems has increasingly driven the physical size ...
This paper revisits replication coupled with checkpointing for fail-stop errors.Replication enables ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
Le passage de l'échelle des nouvelles plates-formes de calcul parallèle et distribué soulève de nomb...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
Processor failures in post-petascale settings are common occurrences. The traditional fault-toleranc...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
Computational power demand for large challenging problems has increasingly driven the physical size ...
This paper revisits replication coupled with checkpointing for fail-stop errors.Replication enables ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
Le passage de l'échelle des nouvelles plates-formes de calcul parallèle et distribué soulève de nomb...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
The parallel computing platforms available today are increasingly larger. Typically the emerging par...
Processor failures in post-petascale settings are common occurrences. The traditional fault-toleranc...