Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Yeung, G
Borowiec, D
Yang, R
Friday, A
Harper, R
Garraghan, P

Open PDF

Open link

Publication date

May 2021

DOI

10.1109/TPDS.2021.3079202

Publisher

Institute of Electrical and Electronics Engineers (IEEE)

Abstract

To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of-the-art resource managers are needed to increase GPU utilization and maximize throughput. While co-locating DL jobs on the same GPU has been shown to be effective, this can incur interference causing slowdown. In this article we propose Horus: an interference-aware and prediction-based resource manager for DL systems. Horus proactively predicts GPU utilization of heterogeneous DL jobs extrapolated from the DL model’s computation graph features, removing the need for online profiling and isolated reserved GPUs. Through micro-benchmarks and job co-location combinations ...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Abstract

Extracted data

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Abstract

Extracted data

Related items

Related items