Exascale platforms require programming models incorporating support for resilience capabilities since the huge number of components they are expected to have is going to increase the number of errors. Checkpoint/restart is a widely used resilience technique due to its robustness and low overhead compared to other techniques. There already exists several solutions implementing this technique, such as FTI or SCR, which focus mainly on providing advanced I/O capabilities to minimize checkpoint/restart time. However, application developers are still in charge of: (1) manually serialize and deserialize the application state using a low-level API; (2) modify the natural flow of the application depending whether the current execution is a restart ...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
With the advent of persistent memory (PM), how to make use of systems that deploy PM is catching int...
This paper describes our experience with the implementation and applications of the Unix checkpointi...
Exascale platforms require programming models incorporating support for resilience capabilities sin...
Exascale platforms require support for resilience capabilities due to increasing numbers of componen...
The advent of cluster computing has resulted in a thrust towards providing software mechanisms for r...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
This paper revisits replication coupled with checkpointing for fail-stop errors.Replication enables ...
One of the major challenges in using extreme scale systems efficiently is to mitigate the impact of ...
As modern supercomputing systems reach the peta-flop perfor-mance range, they grow in both size and ...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Abstract—As failure rate keeps on increasing in large systems, applications running atop restart mor...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
Next-generation exascale systems, those capable of performing a quintillion operations per second, ...
International audienceAlgorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalabil...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
With the advent of persistent memory (PM), how to make use of systems that deploy PM is catching int...
This paper describes our experience with the implementation and applications of the Unix checkpointi...
Exascale platforms require programming models incorporating support for resilience capabilities sin...
Exascale platforms require support for resilience capabilities due to increasing numbers of componen...
The advent of cluster computing has resulted in a thrust towards providing software mechanisms for r...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
This paper revisits replication coupled with checkpointing for fail-stop errors.Replication enables ...
One of the major challenges in using extreme scale systems efficiently is to mitigate the impact of ...
As modern supercomputing systems reach the peta-flop perfor-mance range, they grow in both size and ...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Abstract—As failure rate keeps on increasing in large systems, applications running atop restart mor...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
Next-generation exascale systems, those capable of performing a quintillion operations per second, ...
International audienceAlgorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalabil...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
With the advent of persistent memory (PM), how to make use of systems that deploy PM is catching int...
This paper describes our experience with the implementation and applications of the Unix checkpointi...