Fault-tolerance is becoming increasingly important as we enter the era of exascale computing. Increasing the number of cores results in a smaller mean time between failures, and consequently, higher probability of errors. Among the different software fault tolerance techniques, checkpoint/restart is the most commonly used method in supercomputers, the de-facto standard for large-scale systems. Although there exist several checkpoint/restart implementations for CPUs, only a handful have been proposed for GPUs even though more than 60 supercomputers in the TOP 500 list are heterogeneous CPU-GPU systems. In this work, we propose a scalable application-level checkpoint/restart scheme, called cudaCR for long-running kernels on NVIDIA GPUs. Our ...
Next-generation exascale systems, those capable of performing a quintillion operations per second, ...
As modern GPU workloads become larger and more complex, there is an ever-increasing demand for GPU c...
Deep learning technology has enabled the development of increasingly complex safety-related autonomo...
While High Performance Computing (HPC) systems continue to scale in volume of computing elements and...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability...
Please refer to pdf.James Watt ScholarshipEngineering and Physical Sciences Research Council (EPSRC)...
Each new generation of GPUs vastly increases the resources avail-able to GPGPU programs. GPU program...
Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programm...
High computation is a predominant requirement in many applications. In this field, Graphic Processin...
Abstract—Due to many factors such as, high transistor density, high frequency, and low voltage, toda...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
Resilience is a continuing concern for extreme-scale scientific applications. Tolerating the ever-i...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
This artifact describes the steps to reproduce the results for the CUDA code generation with kernel ...
Next-generation exascale systems, those capable of performing a quintillion operations per second, ...
As modern GPU workloads become larger and more complex, there is an ever-increasing demand for GPU c...
Deep learning technology has enabled the development of increasingly complex safety-related autonomo...
While High Performance Computing (HPC) systems continue to scale in volume of computing elements and...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability...
Please refer to pdf.James Watt ScholarshipEngineering and Physical Sciences Research Council (EPSRC)...
Each new generation of GPUs vastly increases the resources avail-able to GPGPU programs. GPU program...
Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programm...
High computation is a predominant requirement in many applications. In this field, Graphic Processin...
Abstract—Due to many factors such as, high transistor density, high frequency, and low voltage, toda...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
Resilience is a continuing concern for extreme-scale scientific applications. Tolerating the ever-i...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
This artifact describes the steps to reproduce the results for the CUDA code generation with kernel ...
Next-generation exascale systems, those capable of performing a quintillion operations per second, ...
As modern GPU workloads become larger and more complex, there is an ever-increasing demand for GPU c...
Deep learning technology has enabled the development of increasingly complex safety-related autonomo...