Recent advances on deep learning technologies have made GPU clusters popular as training platforms. In this paper, we study reliability issues while focusing on training job failures from analyzing logs collected from running deep learning workloads on a large-scale GPU cluster in production. These failures are largely grouped into two categories, infrastructure and user, based on their sources, and reveal diverse reasons causing the failures. With insights obtained from the failure analysis, we suggest several different ways to improve the stability of shared GPU clusters designed for DL training and optimize user experience by reducing failure occurrences
Training deep learning (DL) models is a highly compute-intensive task since it involves operating on...
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose ...
Currently, Deep learning and especially Convolutional Neural Networks (CNNs) have become a fundame...
With widespread advances in machine learning, a number of large enterprises are beginning to incorpo...
GPUs are highly contended resources in shared clusters for deep learning (DL) training. However, our...
Deep learning (DL) training jobs now constitute a large portion of the jobs in the GPU clusters. Fol...
Deep Learning, specifically Deep Neural Networks (DNNs), is stressing storage systems in new...
Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as ...
Currently, Deep learning and especially Convolutional Neural Networks (CNNs) have become a fundament...
peer reviewedTraining large neural networks with huge amount of data using multiple Graphic Processi...
Node downtime and failed jobs in a computing cluster translate into wasted resources and user dissat...
Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - r...
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose ...
Data analysts predict that the GPU as a Service (GPUaaS) market will grow from US$700 million in 201...
The profound impact of recent developments in artificial intelligence is unquestionable. The applica...
Training deep learning (DL) models is a highly compute-intensive task since it involves operating on...
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose ...
Currently, Deep learning and especially Convolutional Neural Networks (CNNs) have become a fundame...
With widespread advances in machine learning, a number of large enterprises are beginning to incorpo...
GPUs are highly contended resources in shared clusters for deep learning (DL) training. However, our...
Deep learning (DL) training jobs now constitute a large portion of the jobs in the GPU clusters. Fol...
Deep Learning, specifically Deep Neural Networks (DNNs), is stressing storage systems in new...
Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as ...
Currently, Deep learning and especially Convolutional Neural Networks (CNNs) have become a fundament...
peer reviewedTraining large neural networks with huge amount of data using multiple Graphic Processi...
Node downtime and failed jobs in a computing cluster translate into wasted resources and user dissat...
Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - r...
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose ...
Data analysts predict that the GPU as a Service (GPUaaS) market will grow from US$700 million in 201...
The profound impact of recent developments in artificial intelligence is unquestionable. The applica...
Training deep learning (DL) models is a highly compute-intensive task since it involves operating on...
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose ...
Currently, Deep learning and especially Convolutional Neural Networks (CNNs) have become a fundame...