We propose a computationally-friendly adaptive learning rate schedule, ``AdaLoss", which directly uses the information of the loss function to adjust the stepsize in gradient descent methods. We prove that this schedule enjoys linear convergence in linear regression. Moreover, we extend the to the non-convex regime, in the context of two-layer over-parameterized neural networks. If the width is sufficiently large (polynomially), then AdaLoss converges robustly to the global minimum in polynomial time. We numerically verify the theoretical results and extend the scope of the numerical experiments by considering applications in LSTM models for text clarification and policy gradients for control problems
Normalized gradient descent has shown substantial success in speeding up the convergence of exponen...
International audienceIn a series of recent theoretical works, it was shown that strongly over-param...
Modern machine learning models are complex, hierarchical, and large-scale and are trained using non-...
Large-scale machine learning problems can be reduced to non-convex optimization problems if state-of...
We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The meth...
We provide new adaptive first-order methods for constrained convex optimization. Our main algorithms...
Improving adversarial robustness of neural networks remains a major challenge. Fundamentally, traini...
Modern machine learning has made significant breakthroughs for scientific and technological applicat...
A stability criterion for learning is given. In the case of learning-rate adaptation of backpropagat...
In this paper, we incorporate the Barzilai-Borwein step size into gradient descent methods used to t...
This article focuses on gradient-based backpropagation algorithms that use either a common adaptive ...
International audienceTraining over-parameterized neural networks involves the empirical minimizatio...
Highly distributed training of Deep Neural Networks (DNNs) on future compute platforms (offering 100...
Adaptive gradient methods are the method of choice for optimization in machine learning and used to ...
Multi-task learning (MTL) models have demonstrated impressive results in computer vision, natural la...
Normalized gradient descent has shown substantial success in speeding up the convergence of exponen...
International audienceIn a series of recent theoretical works, it was shown that strongly over-param...
Modern machine learning models are complex, hierarchical, and large-scale and are trained using non-...
Large-scale machine learning problems can be reduced to non-convex optimization problems if state-of...
We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The meth...
We provide new adaptive first-order methods for constrained convex optimization. Our main algorithms...
Improving adversarial robustness of neural networks remains a major challenge. Fundamentally, traini...
Modern machine learning has made significant breakthroughs for scientific and technological applicat...
A stability criterion for learning is given. In the case of learning-rate adaptation of backpropagat...
In this paper, we incorporate the Barzilai-Borwein step size into gradient descent methods used to t...
This article focuses on gradient-based backpropagation algorithms that use either a common adaptive ...
International audienceTraining over-parameterized neural networks involves the empirical minimizatio...
Highly distributed training of Deep Neural Networks (DNNs) on future compute platforms (offering 100...
Adaptive gradient methods are the method of choice for optimization in machine learning and used to ...
Multi-task learning (MTL) models have demonstrated impressive results in computer vision, natural la...
Normalized gradient descent has shown substantial success in speeding up the convergence of exponen...
International audienceIn a series of recent theoretical works, it was shown that strongly over-param...
Modern machine learning models are complex, hierarchical, and large-scale and are trained using non-...