As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increasing number of cores as well as the increased complexity of modern heterogenous systems result in substantial decrease of the expected mean time between failures. Among the different fault tolerance techniques, checkpoint/restart is vastly adopted in supercomputing systems. Although many supercomputers in the TOP 500 list use GPUs, only a few checkpoint restart mechanism support GPUs.In this paper, we extend an application level checkpoint library, called fault tolerance interface (FTI), to support multi-node/multi-GPU checkpoints. In contrast to previous work, our library includes a memory manager, which upon a checkpoint invocation tracks th...
The efficient utilization of current supercomputing systems with deep storage hierarchies demands sc...
High-performance computing (HPC) requires resilience techniques such as checkpointing in order to to...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
Please refer to pdf.James Watt ScholarshipEngineering and Physical Sciences Research Council (EPSRC)...
The use of FPGAs in computational workloads is becoming increasingly popular due to the flexibility ...
[Abstract] Heterogeneous systems have increased their popularity in recent years due to the high per...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
International audience—The traditional single-level checkpointing method suffers from significant ov...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
Next-generation exascale systems, those capable of performing a quintillion operations per second, ...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
The efficient utilization of current supercomputing systems with deep storage hierarchies demands sc...
High-performance computing (HPC) requires resilience techniques such as checkpointing in order to to...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
Please refer to pdf.James Watt ScholarshipEngineering and Physical Sciences Research Council (EPSRC)...
The use of FPGAs in computational workloads is becoming increasingly popular due to the flexibility ...
[Abstract] Heterogeneous systems have increased their popularity in recent years due to the high per...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
International audience—The traditional single-level checkpointing method suffers from significant ov...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
Next-generation exascale systems, those capable of performing a quintillion operations per second, ...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
The efficient utilization of current supercomputing systems with deep storage hierarchies demands sc...
High-performance computing (HPC) requires resilience techniques such as checkpointing in order to to...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...