Fault-tolerant distributed applications require mechanisms to recover data lost via a process failure. On modern cluster systems it is typically impractical to request replacement resources after such a failure. Therefore, applications have to continue working with the remaining resources. This requires redistributing the workload and that the non-failed processes reload data. We present an algorithmic framework and its C++ library implementation ReStore for MPI programs that enables recovery of data after process failures. By storing all required data in memory via an appropriate data distribution and replication, recovery is substantially faster than with standard checkpointing schemes that rely on a parallel file system. As the applicati...
[Abstract] Execution times of large-scale computational science and engineering parallel application...
Abstract. Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different ...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
The running times of many computational science applications are much longer than the mean-time-to-f...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
Abstract. Most predictions of Exascale machines picture billion way parallelism, encompassing not on...
In this paper we present a recovery-conscious framework for improving the fault resiliency and recov...
Scientific applications have long embraced the MPI as the environment of choice to execute on large ...
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Com...
Petaflops systems will have tens to hundreds of thousands of compute nodes which increases the likel...
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory mu...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
[Abstract] Execution times of large-scale computational science and engineering parallel application...
Abstract. Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different ...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
The running times of many computational science applications are much longer than the mean-time-to-f...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
Abstract. Most predictions of Exascale machines picture billion way parallelism, encompassing not on...
In this paper we present a recovery-conscious framework for improving the fault resiliency and recov...
Scientific applications have long embraced the MPI as the environment of choice to execute on large ...
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Com...
Petaflops systems will have tens to hundreds of thousands of compute nodes which increases the likel...
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory mu...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
[Abstract] Execution times of large-scale computational science and engineering parallel application...
Abstract. Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different ...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...