Because of increasing hardware and software complexity, the running time of many computational science applica-tions is now more than the mean-time-to-failure of high-peformance computing platforms. Therefore, computational science applications need to tolerate hardware failures. In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordi-nated non-blocking checkpointing, and that existing fault-tolerance protocols in the literature are not suitable for im-plementing this approach. In this paper, we present a suitable protocol, and show how it can be used with a precompile...
International audienceA long-term trend in high-performance computing is the increasing number of no...
Abstract. HPC systems are growing in both complexity and size, increasing the opportunity for system...
Ultra-scale computer clusters with high speed interconnects, such as InfiniBand, are being widely de...
Because of increasing hardware and software complexity, the running time of many computational scie...
The running times of many computational science applications are much longer than the mean-time-to-f...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
[Abstract] Execution times of large-scale computational science and engineering parallel application...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
Abstract — Nowadays, clusters and grids are made of more and more computing nodes. The programming o...
International audience— As reported by many recent studies, the mean time between failures of future...
The running times of large–scale computational science and engineering parallel applications, execut...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
As the size of high performance clusters multiplies, the prob-ability of system failure grows substa...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
International audienceA long-term trend in high-performance computing is the increasing number of no...
Abstract. HPC systems are growing in both complexity and size, increasing the opportunity for system...
Ultra-scale computer clusters with high speed interconnects, such as InfiniBand, are being widely de...
Because of increasing hardware and software complexity, the running time of many computational scie...
The running times of many computational science applications are much longer than the mean-time-to-f...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
[Abstract] Execution times of large-scale computational science and engineering parallel application...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
Abstract — Nowadays, clusters and grids are made of more and more computing nodes. The programming o...
International audience— As reported by many recent studies, the mean time between failures of future...
The running times of large–scale computational science and engineering parallel applications, execut...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
As the size of high performance clusters multiplies, the prob-ability of system failure grows substa...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
International audienceA long-term trend in high-performance computing is the increasing number of no...
Abstract. HPC systems are growing in both complexity and size, increasing the opportunity for system...
Ultra-scale computer clusters with high speed interconnects, such as InfiniBand, are being widely de...