Training neural networks on large datasets can be accelerated by distributing the workload over a network of machines. As datasets grow ever larger, networks of hundreds or thousands of machines become economically viable. The time cost of communicating gradients limits the effectiveness of using such large machine counts, as may the increased chance of network faults. We explore a particularly simple algorithm for robust, communication-efficient learning---signSGD. Workers transmit only the sign of their gradient vector to a server, and the overall update is decided by a majority vote. This algorithm uses 32× less communication per iteration than full-precision, distributed SGD. Under natural conditions verified by experiment, we prove tha...
Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training...
Stochastic optimization algorithms implemented on distributed computing architectures are increasing...
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep network...
Training large neural networks requires distributing learning across multiple workers, where the cos...
For many data-intensive real-world applications, such as recognizing objects from images, detecting ...
Whether it occurs in artificial or biological substrates, {\it learning} is a {distributed} phenomen...
We present AGGREGATHOR, a framework that implements state-of-the-art robust (Byzantine-resilient) di...
In modern day machine learning applications such as self-driving cars, recommender systems, robotics...
We consider distributed optimization under communication constraints for training deep learning mode...
When gradient descent (GD) is scaled to many parallel computing servers (workers) for large scale ma...
While machine learning is going through an era of celebrated success, concerns have been raised abou...
In distributed training of deep neural networks or Federated Learning (FL), people usually run Stoch...
Sign-based algorithms (e.g. signSGD) have been proposed as a biased gradient compression technique t...
Load imbalance pervasively exists in distributed deep learning training systems, either caused by th...
Many areas of deep learning benefit from using increasingly larger neural networks trained on public...
Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training...
Stochastic optimization algorithms implemented on distributed computing architectures are increasing...
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep network...
Training large neural networks requires distributing learning across multiple workers, where the cos...
For many data-intensive real-world applications, such as recognizing objects from images, detecting ...
Whether it occurs in artificial or biological substrates, {\it learning} is a {distributed} phenomen...
We present AGGREGATHOR, a framework that implements state-of-the-art robust (Byzantine-resilient) di...
In modern day machine learning applications such as self-driving cars, recommender systems, robotics...
We consider distributed optimization under communication constraints for training deep learning mode...
When gradient descent (GD) is scaled to many parallel computing servers (workers) for large scale ma...
While machine learning is going through an era of celebrated success, concerns have been raised abou...
In distributed training of deep neural networks or Federated Learning (FL), people usually run Stoch...
Sign-based algorithms (e.g. signSGD) have been proposed as a biased gradient compression technique t...
Load imbalance pervasively exists in distributed deep learning training systems, either caused by th...
Many areas of deep learning benefit from using increasingly larger neural networks trained on public...
Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training...
Stochastic optimization algorithms implemented on distributed computing architectures are increasing...
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep network...