cudaCR: An In-kernel Application-level Checkpoint/Restart Scheme for CUDA Applications

POURGHASSEMI, BEHNAM

Publication date

January 2017

Publisher

eScholarship, University of California

Abstract

Fault-tolerance is becoming increasingly important as we enter the era of exascale computing. Increasing the number of cores results in a smaller mean time between failures, and consequently, higher probability of errors. Among the different software fault tolerance techniques, checkpoint/restart is the most commonly used method in supercomputers, the de-facto standard for large-scale systems. Although there exist several checkpoint/restart implementations for CPUs, only a handful have been proposed for GPUs even though more than 60 supercomputers in the TOP 500 list are heterogeneous CPU-GPU systems. In this work, we propose a scalable application-level checkpoint/restart scheme, called cudaCR for long-running kernels on NVIDIA GPUs. Our ...

Extracted data

We use cookies to provide a better user experience.

Data Protection

cudaCR: An In-kernel Application-level Checkpoint/Restart Scheme for CUDA Applications

Abstract

Extracted data

cudaCR: An In-kernel Application-level Checkpoint/Restart Scheme for CUDA Applications

Abstract

Extracted data

Related items

Related items