Exascale platforms require support for resilience capabilities due to increasing numbers of components and associated error rates. In this paper, we present a new directive-based approach to perform application-level checkpoint/restart in a simplified and portable way. We propose a solution based on compiler directives, similar to OpenMP, that allows users to easily specify the state of the application that has to be saved and restored. This leaves the tedious and error-prone serialization and deserialization activities to our library, which relies on SCR/FTI to perform scalable and efficient I/O operations. Our results, based on several benchmarks and two large applications, reveal no additional overhead compared to the direct use of FTI a...
Abstract. HPC systems are growing in both complexity and size, in-creasing the opportunity for syste...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
Exascale platforms require support for resilience capabilities due to increasing numbers of componen...
Exascale platforms require programming models incorporating support for resilience capabilities sinc...
One of the major challenges in using extreme scale systems efficiently is to mitigate the impact of ...
Trends in high-performance computing are making it nec-essary for long-running applications to toler...
Abstract—As failure rate keeps on increasing in large systems, applications running atop restart mor...
As modern supercomputing systems reach the peta-flop perfor-mance range, they grow in both size and ...
The advent of cluster computing has resulted in a thrust towards providing software mechanisms for r...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
This document has 4 main objectives: (1) Describe data to be saved and restored during checkpoint/re...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
Abstract. HPC systems are growing in both complexity and size, increasing the opportunity for system...
Trends in high-performance computing are making it nec-essary for long-running applications to toler...
Abstract. HPC systems are growing in both complexity and size, in-creasing the opportunity for syste...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
Exascale platforms require support for resilience capabilities due to increasing numbers of componen...
Exascale platforms require programming models incorporating support for resilience capabilities sinc...
One of the major challenges in using extreme scale systems efficiently is to mitigate the impact of ...
Trends in high-performance computing are making it nec-essary for long-running applications to toler...
Abstract—As failure rate keeps on increasing in large systems, applications running atop restart mor...
As modern supercomputing systems reach the peta-flop perfor-mance range, they grow in both size and ...
The advent of cluster computing has resulted in a thrust towards providing software mechanisms for r...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
This document has 4 main objectives: (1) Describe data to be saved and restored during checkpoint/re...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
Abstract. HPC systems are growing in both complexity and size, increasing the opportunity for system...
Trends in high-performance computing are making it nec-essary for long-running applications to toler...
Abstract. HPC systems are growing in both complexity and size, in-creasing the opportunity for syste...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
High performance computing applications must be tolerant to faults, which are common occurrences esp...