The ever increasing number of processors used in parallel computers is making fault tolerance support in large-scale parallel systems more and more important. We discuss the inadequacies of existing system-level checkpointing solutions for message-passing applications as the system scales up. We analyze the coordination cost and blocking behavior of two current MPI implementations with checkpointing support. A group-based solution combining coordinated checkpointing and message logging is then proposed. Experiment results demonstrate its better performance and scalability than LAM/MPI and MPICH-VCL. To assist group formation, a method to analyze the communication behaviors of the application is proposed. ©2008 IEEE.published_or_final_versio
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
This paper presents a new checkpointing algorithm for systems using reliable communication channels....
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
The ever increasing number of processors used in parallel computers is making fault tolerance suppor...
International audienceA long-term trend in high-performance computing is the increasing number of no...
International audienceThe high failure rate expected for future supercomputers requires the design o...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
We present a unified fault-tolerance framework for task-parallel message-passing applications to mit...
Abstract — Nowadays, clusters and grids are made of more and more computing nodes. The programming o...
International audience— As reported by many recent studies, the mean time between failures of future...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
This paper presents a new checkpointing algorithm for systems using reliable communication channels....
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
The ever increasing number of processors used in parallel computers is making fault tolerance suppor...
International audienceA long-term trend in high-performance computing is the increasing number of no...
International audienceThe high failure rate expected for future supercomputers requires the design o...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
We present a unified fault-tolerance framework for task-parallel message-passing applications to mit...
Abstract — Nowadays, clusters and grids are made of more and more computing nodes. The programming o...
International audience— As reported by many recent studies, the mean time between failures of future...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
This paper presents a new checkpointing algorithm for systems using reliable communication channels....
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...