To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of-the-art resource managers are needed to increase GPU utilization and maximize throughput. While co-locating DL jobs on the same GPU has been shown to be effective, this can incur interference causing slowdown. In this article we propose Horus: an interference-aware and prediction-based resource manager for DL systems. Horus proactively predicts GPU utilization of heterogeneous DL jobs extrapolated from the DL model’s computation graph features, removing the need for online profiling and isolated reserved GPUs. Through micro-benchmarks and job co-location combinations ...
Our work seeks to improve and adapt computing systems and machine learning (ML) algorithms to match ...
With widespread advances in machine learning, a number of large enterprises are beginning to incorpo...
peer reviewedTraining large neural networks with huge amount of data using multiple Graphic Processi...
Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - r...
Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - r...
Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as ...
DL has pervaded many areas of computing due to the confluence of the explosive growth of large-scale...
Recent advances in hardware, such as systems with multiple GPUs and their availability in the cloud,...
The Deep Learning (DL) paradigm gained remarkable popularity in recent years. DL models are used to ...
Deep learning (DL) training jobs now constitute a large portion of the jobs in the GPU clusters. Fol...
Deep Learning (DL) methods currently address a variety of complex tasks. GPUs significantly accelera...
Serverless computing (FaaS) has been extensively utilized for deep learning (DL) inference due to th...
Recent decades have witnessed the breakthrough of deep learning algorithms, which have been widely u...
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose ...
GPU-based clusters are widely chosen for accelerating a variety of scientific applications in high-e...
Our work seeks to improve and adapt computing systems and machine learning (ML) algorithms to match ...
With widespread advances in machine learning, a number of large enterprises are beginning to incorpo...
peer reviewedTraining large neural networks with huge amount of data using multiple Graphic Processi...
Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - r...
Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - r...
Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as ...
DL has pervaded many areas of computing due to the confluence of the explosive growth of large-scale...
Recent advances in hardware, such as systems with multiple GPUs and their availability in the cloud,...
The Deep Learning (DL) paradigm gained remarkable popularity in recent years. DL models are used to ...
Deep learning (DL) training jobs now constitute a large portion of the jobs in the GPU clusters. Fol...
Deep Learning (DL) methods currently address a variety of complex tasks. GPUs significantly accelera...
Serverless computing (FaaS) has been extensively utilized for deep learning (DL) inference due to th...
Recent decades have witnessed the breakthrough of deep learning algorithms, which have been widely u...
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose ...
GPU-based clusters are widely chosen for accelerating a variety of scientific applications in high-e...
Our work seeks to improve and adapt computing systems and machine learning (ML) algorithms to match ...
With widespread advances in machine learning, a number of large enterprises are beginning to incorpo...
peer reviewedTraining large neural networks with huge amount of data using multiple Graphic Processi...