We prove linear convergence of gradient descent to a global minimum for the training of deep residual networks with constant layer width and smooth activation function. We further show that the trained weights, as a function of the layer index, admits a scaling limit which is H\"older continuous as the depth of the network tends to infinity. The proofs are based on non-asymptotic estimates of the loss function and of norms of the network weights along the gradient descent path. We illustrate the relevance of our theoretical results to practical settings using detailed numerical experiments on supervised learning problems
Stochastic gradient descent (SGD) remains the method of choice for deep learning, despite the limita...
International audienceMany supervised machine learning methods are naturally cast as optimization pr...
A main puzzle of deep networks revolves around the absence of overfitting despite overparametrizatio...
© 2020 National Academy of Sciences. All rights reserved. While deep learning is successful in a num...
Overparametrization is a key factor in the absence of convexity to explain global convergence of gra...
In this article we study fully-connected feedforward deep ReLU ANNs with an arbitrarily large number...
This article presents a new criterion for convergence of gradient descent to a global minimum. The c...
© 2020, The Author(s). Overparametrized deep networks predict well, despite the lack of an explicit ...
Deep learning has become an important toolkit for data science and artificial intelligence. In contr...
Neural networks have been very successful in many applications; we often, however, lack a theoretica...
Today, various forms of neural networks are trained to perform approximation tasks in many fields. H...
Neural networks trained via gradient descent with random initialization and without any regularizati...
The classical statistical learning theory implies that fitting too many parameters leads to overfitt...
Recent works have shown that gradient descent can find a global minimum for over-parameterized neura...
An approximated gradient method for training Elman networks is considered. For finite sample set, th...
Stochastic gradient descent (SGD) remains the method of choice for deep learning, despite the limita...
International audienceMany supervised machine learning methods are naturally cast as optimization pr...
A main puzzle of deep networks revolves around the absence of overfitting despite overparametrizatio...
© 2020 National Academy of Sciences. All rights reserved. While deep learning is successful in a num...
Overparametrization is a key factor in the absence of convexity to explain global convergence of gra...
In this article we study fully-connected feedforward deep ReLU ANNs with an arbitrarily large number...
This article presents a new criterion for convergence of gradient descent to a global minimum. The c...
© 2020, The Author(s). Overparametrized deep networks predict well, despite the lack of an explicit ...
Deep learning has become an important toolkit for data science and artificial intelligence. In contr...
Neural networks have been very successful in many applications; we often, however, lack a theoretica...
Today, various forms of neural networks are trained to perform approximation tasks in many fields. H...
Neural networks trained via gradient descent with random initialization and without any regularizati...
The classical statistical learning theory implies that fitting too many parameters leads to overfitt...
Recent works have shown that gradient descent can find a global minimum for over-parameterized neura...
An approximated gradient method for training Elman networks is considered. For finite sample set, th...
Stochastic gradient descent (SGD) remains the method of choice for deep learning, despite the limita...
International audienceMany supervised machine learning methods are naturally cast as optimization pr...
A main puzzle of deep networks revolves around the absence of overfitting despite overparametrizatio...