Because of increasing hardware and software complexity, the running time of many computational science applications is now more than the mean-time-to-failure of high-performance computing platforms. Therefore, computational science applications need to tolerate hardware failures. In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in teh literature are not suitable for implementing this approach. In this paper, we present a suitable protocol, and show how it can be used with a pr...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
Abstract. HPC systems are growing in both complexity and size, increasing the opportunity for system...
Abstract. HPC systems are growing in both complexity and size, in-creasing the opportunity for syste...
Because of increasing hardware and software complexity, the running time of many computational scien...
The running times of many computational science applications are much longer than the mean-time-to-f...
Abstract — Nowadays, clusters and grids are made of more and more computing nodes. The programming o...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
International audience— As reported by many recent studies, the mean time between failures of future...
[Abstract] Execution times of large-scale computational science and engineering parallel application...
International audienceA long-term trend in high-performance computing is the increasing number of no...
The running times of large–scale computational science and engineering parallel applications, execut...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
As the size of high performance clusters multiplies, the prob-ability of system failure grows substa...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
Abstract. HPC systems are growing in both complexity and size, increasing the opportunity for system...
Abstract. HPC systems are growing in both complexity and size, in-creasing the opportunity for syste...
Because of increasing hardware and software complexity, the running time of many computational scien...
The running times of many computational science applications are much longer than the mean-time-to-f...
Abstract — Nowadays, clusters and grids are made of more and more computing nodes. The programming o...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
International audience— As reported by many recent studies, the mean time between failures of future...
[Abstract] Execution times of large-scale computational science and engineering parallel application...
International audienceA long-term trend in high-performance computing is the increasing number of no...
The running times of large–scale computational science and engineering parallel applications, execut...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
As the size of high performance clusters multiplies, the prob-ability of system failure grows substa...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
Abstract. HPC systems are growing in both complexity and size, increasing the opportunity for system...
Abstract. HPC systems are growing in both complexity and size, in-creasing the opportunity for syste...