We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters. CASSINI introduces a novel geometric abstraction to consider the communication pattern of different jobs while placing them on network links. To do so, CASSINI uses an affinity graph that finds a series of time-shift values to adjust the communication phases of a subset of jobs, such that the communication patterns of jobs sharing the same network link are interleaved with each other. Experiments with 13 common ML models on a 24-server testbed demonstrate that compared to the state-of-the-art ML schedulers, CASSINI improves the average and tail completion time of jobs by up to 1.6x and 2.5x, respectively. Moreover, we show that CASSINI reduces the number o...
Traffic for a typical MapReduce job in a datacenter consists of multiple network flows. Traditionall...
Training large machine learning (ML) models with many variables or parameters can take a long time i...
Slow running or straggler tasks in distributed processing frameworks [1, 2] can be 6 to 8 times slow...
To reduce the impact of network congestion on big data jobs, cluster management frameworks use vario...
Distributed data-parallel processing systems like MapReduce, Spark, and Flink are popular for analyz...
Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as ...
Recent years have witnessed a rapid growth of distributed machine learning (ML) frameworks, which ex...
The standard scheduler of Hadoop does not consider the characteristics of jobs such as computational...
Systems for running distributed deep learning training on the cloud have recently been developed. An...
The growth in size and computational requirements in training Neural Networks (NN) over the past few...
Deep neural networks (DNNs) have recently yielded strong results on a range of applications. Trainin...
Running MapReduce applications in shared clusters is becoming increasingly compelling to improve the...
Stemming from the growth and increased complexity of computer vision, natural language processing, a...
Machine learning (ML) has become a powerful building block for modern services, scientific endeavors...
scheduling In this paper, we utilize a bandwidth-centric job communication model that captures the i...
Traffic for a typical MapReduce job in a datacenter consists of multiple network flows. Traditionall...
Training large machine learning (ML) models with many variables or parameters can take a long time i...
Slow running or straggler tasks in distributed processing frameworks [1, 2] can be 6 to 8 times slow...
To reduce the impact of network congestion on big data jobs, cluster management frameworks use vario...
Distributed data-parallel processing systems like MapReduce, Spark, and Flink are popular for analyz...
Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as ...
Recent years have witnessed a rapid growth of distributed machine learning (ML) frameworks, which ex...
The standard scheduler of Hadoop does not consider the characteristics of jobs such as computational...
Systems for running distributed deep learning training on the cloud have recently been developed. An...
The growth in size and computational requirements in training Neural Networks (NN) over the past few...
Deep neural networks (DNNs) have recently yielded strong results on a range of applications. Trainin...
Running MapReduce applications in shared clusters is becoming increasingly compelling to improve the...
Stemming from the growth and increased complexity of computer vision, natural language processing, a...
Machine learning (ML) has become a powerful building block for modern services, scientific endeavors...
scheduling In this paper, we utilize a bandwidth-centric job communication model that captures the i...
Traffic for a typical MapReduce job in a datacenter consists of multiple network flows. Traditionall...
Training large machine learning (ML) models with many variables or parameters can take a long time i...
Slow running or straggler tasks in distributed processing frameworks [1, 2] can be 6 to 8 times slow...