Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they have to checkpoint the dynamic variables periodically in case of unavoidable fail-stop errors, requiring fast I/O systems and large storage space. To this end, significantly reducing the checkpointing overhead is critical to improving the overall performance of iterative methods. Our contribution is fourfold. (1) We propose a novel lossy checkpointing scheme that can significantly improve the checkpointing performance of iterative methods by leveraging lossy compressors. (2) We formulate ...
International audienceThis paper investigates the optimal number of processors to execute a parallel...
This paper investigates the optimal number of processors to execute a parallel job, whose speedup pr...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Several recovery techniques for parallel iterative methods are presented. First, the implementation ...
Future exascale systems are expected to be characterized by more frequent failures than current peta...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
The probability that a failure will occur before the end of the computation increases as the number ...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceThis paper investigates the optimal number of processors to execute a parallel...
This paper investigates the optimal number of processors to execute a parallel job, whose speedup pr...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Several recovery techniques for parallel iterative methods are presented. First, the implementation ...
Future exascale systems are expected to be characterized by more frequent failures than current peta...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
The probability that a failure will occur before the end of the computation increases as the number ...
International audienceWe focus on High Performance Computing (HPC) workflows whose dependency graph ...
International audienceThis work provides an optimal checkpointing strategy to protect iterative appl...
International audienceThis paper investigates the optimal number of processors to execute a parallel...
This paper investigates the optimal number of processors to execute a parallel job, whose speedup pr...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...