Deep learning (DL) training jobs now constitute a large portion of the jobs in the GPU clusters. Following the success of deep learning in various domains such as natural language processing, image classification, and object detection GPUs have become the new member of the computing clusters. Due to various reasons, GPUs are highly underutilized in the production GPU clusters. In this thesis, we design a scheduler that uses co-location to improve the GPU utilization in these clusters. Using in-depth profiling of DL jobs, we provide metrics that guide us on the compatibility of different DL jobs. Using these profiling data we are able to achieve almost 2X speedup in the makespan when using co-location compared to the first-in-first-out basel...
The Deep Learning (DL) paradigm gained remarkable popularity in recent years. DL models are used to ...
Artificial Intelligence (AI) and Deep Learning (DL) algorithms are currently applied to a wide range...
Deep learning is an emerging workload in the field of HPC. This powerful method of resolution is abl...
Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as ...
peer reviewedTraining large neural networks with huge amount of data using multiple Graphic Processi...
With widespread advances in machine learning, a number of large enterprises are beginning to incorpo...
Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - r...
Recent advances in hardware, such as systems with multiple GPUs and their availability in the cloud,...
The rise of deep-learning (DL) has been fuelled by the improvements in accelerators. Due to its uniq...
To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware...
With the widespread using of GPU hardware facilities, more and more distributed machine learning app...
Our work seeks to improve and adapt computing systems and machine learning (ML) algorithms to match ...
Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - r...
DL has pervaded many areas of computing due to the confluence of the explosive growth of large-scale...
Deep Learning (DL) methods currently address a variety of complex tasks. GPUs significantly accelera...
The Deep Learning (DL) paradigm gained remarkable popularity in recent years. DL models are used to ...
Artificial Intelligence (AI) and Deep Learning (DL) algorithms are currently applied to a wide range...
Deep learning is an emerging workload in the field of HPC. This powerful method of resolution is abl...
Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as ...
peer reviewedTraining large neural networks with huge amount of data using multiple Graphic Processi...
With widespread advances in machine learning, a number of large enterprises are beginning to incorpo...
Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - r...
Recent advances in hardware, such as systems with multiple GPUs and their availability in the cloud,...
The rise of deep-learning (DL) has been fuelled by the improvements in accelerators. Due to its uniq...
To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware...
With the widespread using of GPU hardware facilities, more and more distributed machine learning app...
Our work seeks to improve and adapt computing systems and machine learning (ML) algorithms to match ...
Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - r...
DL has pervaded many areas of computing due to the confluence of the explosive growth of large-scale...
Deep Learning (DL) methods currently address a variety of complex tasks. GPUs significantly accelera...
The Deep Learning (DL) paradigm gained remarkable popularity in recent years. DL models are used to ...
Artificial Intelligence (AI) and Deep Learning (DL) algorithms are currently applied to a wide range...
Deep learning is an emerging workload in the field of HPC. This powerful method of resolution is abl...