Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead. This paper proposes Ok-Topk, a scheme for distributed training with sparse gradients. Ok-Topk integrates a novel sparse allreduce algorithm (less than 6k communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved. To reduce the sparsification overhead, Ok-Topk...
Compressed communication, in the form of sparsification or quantization of stochastic gradients, is ...
Applying machine learning techniques to the quickly growing data in science and industry requires hi...
We study lossy gradient compression methods to alleviate the communication bottleneck in data-parall...
To train deep learning models faster, distributed training on multiple GPUs is the very popular sche...
Distributed training of massive machine learning models, in particular deep neural networks, via Sto...
In recent years, the rapid development of new generation information technology has resulted in an u...
In recent years, the rapid development of new generation information technology has resulted in an u...
In recent years, the rapid development of new generation information technology has resulted in an u...
Stochastic optimization algorithms implemented on distributed computing architectures are increasing...
The distributed training of deep learning models faces two issues: efficiency and privacy. First of ...
Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i....
Distributed deep learning becomes very common to reduce the overall training time by exploiting mult...
We consider distributed optimization under communication constraints for training deep learning mode...
Load imbalance pervasively exists in distributed deep learning training systems, either caused by th...
Training large neural networks is time consuming. To speed up the process, distributed training is o...
Compressed communication, in the form of sparsification or quantization of stochastic gradients, is ...
Applying machine learning techniques to the quickly growing data in science and industry requires hi...
We study lossy gradient compression methods to alleviate the communication bottleneck in data-parall...
To train deep learning models faster, distributed training on multiple GPUs is the very popular sche...
Distributed training of massive machine learning models, in particular deep neural networks, via Sto...
In recent years, the rapid development of new generation information technology has resulted in an u...
In recent years, the rapid development of new generation information technology has resulted in an u...
In recent years, the rapid development of new generation information technology has resulted in an u...
Stochastic optimization algorithms implemented on distributed computing architectures are increasing...
The distributed training of deep learning models faces two issues: efficiency and privacy. First of ...
Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i....
Distributed deep learning becomes very common to reduce the overall training time by exploiting mult...
We consider distributed optimization under communication constraints for training deep learning mode...
Load imbalance pervasively exists in distributed deep learning training systems, either caused by th...
Training large neural networks is time consuming. To speed up the process, distributed training is o...
Compressed communication, in the form of sparsification or quantization of stochastic gradients, is ...
Applying machine learning techniques to the quickly growing data in science and industry requires hi...
We study lossy gradient compression methods to alleviate the communication bottleneck in data-parall...