With the evolution of high-performance computing towards heterogeneous, massively par-allel systems, parallel applications have developed new fault tolerance necessities. Check-pointing has become a widely used technique to obtain fault tolerance. Whether due to a failure in the execution or to a migration of the processes to different machines, checkpoint-ing tools must be able to operate in heterogeneous environments. Portable checkpointers usually work around portability issues at the cost of transparency: the user must provide in-formation as what data needs to be stored, where to store it, or where to checkpoint. CPPC (Controller/Precompiler for Portable Checkpointing) is a checkpointing tool designed to fea-ture both portability and t...
Current approaches for checkpointing and recovery assume system homogeneity, where checkpointing and...
As modern supercomputing systems reach the peta-flop perfor-mance range, they grow in both size and ...
Trends in high-performance computing are making it nec-essary for long-running applications to toler...
This is the peer reviewed version of the following article: Rodríguez, G. , Martín, M. J., González,...
compiler for Portable Checkpointing), a checkpointing tool designed for heterogeneous clusters and G...
Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing f...
This paper focuses on the performance evaluation of Compiler for Portable Checkpointing (CPPC), a to...
This is a post-peer-review, pre-copyedit version of an article published in The Computer Journal. Th...
This is a post-peer-review, pre-copyedit version of an article published in International Journal of...
Message passing applications on a distributed computer require tools to integrate state saving and r...
The contributions of this paper are the following. • We describe the implementation of the C3 system...
The contributions of this paper are the following. • We describe the implementation of the C3 system...
Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this,...
Checkpointing tools may be typically implemented at two different abstraction levels: at the system ...
The contributions of this paper are the following. We describe the implementation of the $C^3$ syst...
Current approaches for checkpointing and recovery assume system homogeneity, where checkpointing and...
As modern supercomputing systems reach the peta-flop perfor-mance range, they grow in both size and ...
Trends in high-performance computing are making it nec-essary for long-running applications to toler...
This is the peer reviewed version of the following article: Rodríguez, G. , Martín, M. J., González,...
compiler for Portable Checkpointing), a checkpointing tool designed for heterogeneous clusters and G...
Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing f...
This paper focuses on the performance evaluation of Compiler for Portable Checkpointing (CPPC), a to...
This is a post-peer-review, pre-copyedit version of an article published in The Computer Journal. Th...
This is a post-peer-review, pre-copyedit version of an article published in International Journal of...
Message passing applications on a distributed computer require tools to integrate state saving and r...
The contributions of this paper are the following. • We describe the implementation of the C3 system...
The contributions of this paper are the following. • We describe the implementation of the C3 system...
Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this,...
Checkpointing tools may be typically implemented at two different abstraction levels: at the system ...
The contributions of this paper are the following. We describe the implementation of the $C^3$ syst...
Current approaches for checkpointing and recovery assume system homogeneity, where checkpointing and...
As modern supercomputing systems reach the peta-flop perfor-mance range, they grow in both size and ...
Trends in high-performance computing are making it nec-essary for long-running applications to toler...