GPUs are highly contended resources in shared clusters for deep learning (DL) training. However, our analysis with a real-world trace reveals that a non-negligible number of jobs running on the cluster undergo failures and are blindly retried by the job scheduler. Unfortunately, these job failures often repeat and waste GPU resources, limiting effective GPU utilization across the cluster. In this paper, we introduce Sibylla which informs whether an observed failure of DL training will repeat or not upon retry on the failure. Sibylla employs a machine learning model based on RNNs that trains on stdout and stderr logs of failed jobs and can continuously update the model on new log messages without hand-constructing labels for the new training...
To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware...
The amount of data generated by computing clusters is very large, including nodes resources data or ...
With the increasing complexity and scope of software systems, their dependability is crucial. The a...
Recent advances on deep learning technologies have made GPU clusters popular as training platforms. ...
Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as ...
Deep Learning, specifically Deep Neural Networks (DNNs), is stressing storage systems in new...
Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - r...
With widespread advances in machine learning, a number of large enterprises are beginning to incorpo...
Modern applications, such as smart cities, home automation, and eHealth, demand a new approach to im...
Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - r...
Large high-performance computing systems are built with increasing number of components with more CP...
Cloud failure is one of the critical issues since it can cost millions of dollars to cloud service p...
Deep Learning (DL) methods currently address a variety of complex tasks. GPUs significantly accelera...
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose ...
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose ...
To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware...
The amount of data generated by computing clusters is very large, including nodes resources data or ...
With the increasing complexity and scope of software systems, their dependability is crucial. The a...
Recent advances on deep learning technologies have made GPU clusters popular as training platforms. ...
Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as ...
Deep Learning, specifically Deep Neural Networks (DNNs), is stressing storage systems in new...
Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - r...
With widespread advances in machine learning, a number of large enterprises are beginning to incorpo...
Modern applications, such as smart cities, home automation, and eHealth, demand a new approach to im...
Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - r...
Large high-performance computing systems are built with increasing number of components with more CP...
Cloud failure is one of the critical issues since it can cost millions of dollars to cloud service p...
Deep Learning (DL) methods currently address a variety of complex tasks. GPUs significantly accelera...
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose ...
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose ...
To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware...
The amount of data generated by computing clusters is very large, including nodes resources data or ...
With the increasing complexity and scope of software systems, their dependability is crucial. The a...