One of the major challenges in using extreme scale systems efficiently is to mitigate the impact of faults. Application-level checkpoint/restart (CR) methods provide the best trade-off between productivity, robustness, and performance. There are many solutions implementing CR at the application level. They all provide advanced I/O capabilities to minimize the overhead introduced by CR. Nevertheless, there is still room for improvement in terms of programmability and flexibility, because end-users must manually serialize and deserialize application state using low-level APIs, modify the flow of the application to consider restarts, or rewrite CR code whenever the backend library changes. In this work, we propose a set of compiler directives ...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
Exascale platforms require support for resilience capabilities due to increasing numbers of componen...
Exascale platforms require programming models incorporating support for resilience capabilities sin...
[Abstract] Heterogeneous systems have increased their popularity in recent years due to the high per...
Abstract. HPC systems are growing in both complexity and size, increasing the opportunity for system...
As modern supercomputing systems reach the peta-flop perfor-mance range, they grow in both size and ...
Abstract. HPC systems are growing in both complexity and size, in-creasing the opportunity for syste...
As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) improves scientific prod...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
The advent of cluster computing has resulted in a thrust towards providing software mechanisms for r...
Trends in high-performance computing are making it nec-essary for long-running applications to toler...
The efficient utilization of current supercomputing systems with deep storage hierarchies demands sc...
The Partitioned Global Address Space (PGAS) has emerged recently for parallel programming at large s...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
Exascale platforms require support for resilience capabilities due to increasing numbers of componen...
Exascale platforms require programming models incorporating support for resilience capabilities sin...
[Abstract] Heterogeneous systems have increased their popularity in recent years due to the high per...
Abstract. HPC systems are growing in both complexity and size, increasing the opportunity for system...
As modern supercomputing systems reach the peta-flop perfor-mance range, they grow in both size and ...
Abstract. HPC systems are growing in both complexity and size, in-creasing the opportunity for syste...
As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) improves scientific prod...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
The advent of cluster computing has resulted in a thrust towards providing software mechanisms for r...
Trends in high-performance computing are making it nec-essary for long-running applications to toler...
The efficient utilization of current supercomputing systems with deep storage hierarchies demands sc...
The Partitioned Global Address Space (PGAS) has emerged recently for parallel programming at large s...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...