Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data. A major roadblock faced when increasing the batch size to a substantial fraction of the training data for reducing training time is the persistent degradation in performance (generalization gap). To address this issue, recent work propose to add small perturbations to the model parameters when computing the stochastic gradients and report improved generalization performance due to smoothing effects. However, this approach is poorly understood; it requires often model-specific noise and fine-tuning. To alleviate these drawbacks, we propo...
Regularized nonlinear acceleration (RNA) is a generic extrapolation scheme for optimization methods,...
Stochastic Gradient Descent algorithms (SGD) remain a popular optimizer for deep learning networks a...
[previously titled "Theory of Deep Learning III: Generalization Properties of SGD"] In Theory III we...
Large-batch SGD is important for scaling training of deep neural networks. However, without fine-tun...
Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep n...
We present a comprehensive framework of search methods, such as simulated annealing and batch traini...
Deep neural networks have become the state-of-the-art tool to solve many computer vision problems. H...
State-of-the-art training algorithms for deep learning models are based on stochastic gradient desce...
In modern supervised learning, many deep neural networks are able to interpolate the data: the empir...
Optimization is the key component of deep learning. Increasing depth, which is vital for reaching a...
Distributed training of massive machine learning models, in particular deep neural networks, via Sto...
The remarkable practical success of deep learning has revealed some major surprises from a theoretic...
The success of deep learning has shown impressive empirical breakthroughs, but many theoretical ques...
In stochastic gradient descent (SGD) and its variants, the optimized gradient estimators may be as e...
Regularized nonlinear acceleration (RNA) is a generic extrapolation scheme for optimization methods,...
Regularized nonlinear acceleration (RNA) is a generic extrapolation scheme for optimization methods,...
Stochastic Gradient Descent algorithms (SGD) remain a popular optimizer for deep learning networks a...
[previously titled "Theory of Deep Learning III: Generalization Properties of SGD"] In Theory III we...
Large-batch SGD is important for scaling training of deep neural networks. However, without fine-tun...
Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep n...
We present a comprehensive framework of search methods, such as simulated annealing and batch traini...
Deep neural networks have become the state-of-the-art tool to solve many computer vision problems. H...
State-of-the-art training algorithms for deep learning models are based on stochastic gradient desce...
In modern supervised learning, many deep neural networks are able to interpolate the data: the empir...
Optimization is the key component of deep learning. Increasing depth, which is vital for reaching a...
Distributed training of massive machine learning models, in particular deep neural networks, via Sto...
The remarkable practical success of deep learning has revealed some major surprises from a theoretic...
The success of deep learning has shown impressive empirical breakthroughs, but many theoretical ques...
In stochastic gradient descent (SGD) and its variants, the optimized gradient estimators may be as e...
Regularized nonlinear acceleration (RNA) is a generic extrapolation scheme for optimization methods,...
Regularized nonlinear acceleration (RNA) is a generic extrapolation scheme for optimization methods,...
Stochastic Gradient Descent algorithms (SGD) remain a popular optimizer for deep learning networks a...
[previously titled "Theory of Deep Learning III: Generalization Properties of SGD"] In Theory III we...