In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models. However, the use of adaptivity not only comes at the cost of extra memory but also raises the fundamental question: can non-adaptive methods like SGD enjoy similar benefits? In this paper, we provide an affirmative answer to this question by proposing to achieve both robust and memory-efficient training via the following general recipe: (1) modify the architecture and make it scale invariant, i.e. the scale of parameter doesn't affect the output of the network, (2) train with SGD and weight decay, and optionally (3) clip the global gradient norm proportional to weight norm multiplied by $\sqrt{\tfrac{2\lam...
Optimization is the key component of deep learning. Increasing depth, which is vital for reaching a...
Training neural networks on large datasets can be accelerated by distributing the workload over a ne...
In modern supervised learning, many deep neural networks are able to interpolate the data: the empir...
Deep ResNets are recognized for achieving state-of-the-art results in complex machine learning tasks...
In this paper, we incorporate the Barzilai-Borwein step size into gradient descent methods used to t...
In the recent decade, deep neural networks have solved ever more complex tasks across many fronts in...
The architecture and the parameters of neural networks are often optimized independently, which requ...
We propose a new per-layer adaptive step-size procedure for stochastic first-order optimization meth...
Short version of https://arxiv.org/abs/1709.01427International audienceWhen applied to training deep...
Over decades, gradient descent has been applied to develop learning algorithm to train a neural netw...
Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep n...
The success of deep learning has shown impressive empirical breakthroughs, but many theoretical ques...
We discover restrained numerical instabilities in current training practices of deep networks with S...
This article presents a new criterion for convergence of gradient descent to a global minimum. The c...
Weight decay is a popular regularization technique for training of deep neural networks. Modern deep...
Optimization is the key component of deep learning. Increasing depth, which is vital for reaching a...
Training neural networks on large datasets can be accelerated by distributing the workload over a ne...
In modern supervised learning, many deep neural networks are able to interpolate the data: the empir...
Deep ResNets are recognized for achieving state-of-the-art results in complex machine learning tasks...
In this paper, we incorporate the Barzilai-Borwein step size into gradient descent methods used to t...
In the recent decade, deep neural networks have solved ever more complex tasks across many fronts in...
The architecture and the parameters of neural networks are often optimized independently, which requ...
We propose a new per-layer adaptive step-size procedure for stochastic first-order optimization meth...
Short version of https://arxiv.org/abs/1709.01427International audienceWhen applied to training deep...
Over decades, gradient descent has been applied to develop learning algorithm to train a neural netw...
Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep n...
The success of deep learning has shown impressive empirical breakthroughs, but many theoretical ques...
We discover restrained numerical instabilities in current training practices of deep networks with S...
This article presents a new criterion for convergence of gradient descent to a global minimum. The c...
Weight decay is a popular regularization technique for training of deep neural networks. Modern deep...
Optimization is the key component of deep learning. Increasing depth, which is vital for reaching a...
Training neural networks on large datasets can be accelerated by distributing the workload over a ne...
In modern supervised learning, many deep neural networks are able to interpolate the data: the empir...