This work finds the analytical expression of the global minima of a deep linear network with weight decay and stochastic neurons, a fundamental model for understanding the landscape of neural networks. Our result implies that zero is a special point in deep neural network architecture. We show that weight decay strongly interacts with the model architecture and can create bad minima at zero in a network with more than $1$ hidden layer, qualitatively different from a network with only $1$ hidden layer. Practically, our result implies that common deep learning initialization methods are insufficient to ease the optimization of neural networks in general.Comment: theorem 3 fixed and minor change
In this paper, we prove a conjecture published in 1989 and also partially address an open problem an...
This article presents a new criterion for convergence of gradient descent to a global minimum. The c...
The deep learning optimization community has observed how the neural networks generalization ability...
A main puzzle of deep networks revolves around the absence of overfitting despite overparametrizatio...
We study the optimization landscape of deep linear neural networks with the square loss. It is known...
Despite the widespread practical success of deep learning methods, our theoretical understanding of ...
The classical statistical learning theory implies that fitting too many parameters leads to overfitt...
Despite the widespread practical success of deep learning methods, our theoretical understanding of ...
© 2020 National Academy of Sciences. All rights reserved. While deep learning is successful in a num...
We develop new theoretical results on matrix perturbation to shed light on the impact of architectur...
In the recent decade, deep neural networks have solved ever more complex tasks across many fronts in...
This thesis characterizes the training process of deep neural networks. We are driven by two apparen...
Understanding the loss surface of neural networks is essential for the design of models with predict...
© 7th International Conference on Learning Representations, ICLR 2019. All Rights Reserved. We inves...
The question of how and why the phenomenon of mode connectivity occurs in training deep neural netwo...
In this paper, we prove a conjecture published in 1989 and also partially address an open problem an...
This article presents a new criterion for convergence of gradient descent to a global minimum. The c...
The deep learning optimization community has observed how the neural networks generalization ability...
A main puzzle of deep networks revolves around the absence of overfitting despite overparametrizatio...
We study the optimization landscape of deep linear neural networks with the square loss. It is known...
Despite the widespread practical success of deep learning methods, our theoretical understanding of ...
The classical statistical learning theory implies that fitting too many parameters leads to overfitt...
Despite the widespread practical success of deep learning methods, our theoretical understanding of ...
© 2020 National Academy of Sciences. All rights reserved. While deep learning is successful in a num...
We develop new theoretical results on matrix perturbation to shed light on the impact of architectur...
In the recent decade, deep neural networks have solved ever more complex tasks across many fronts in...
This thesis characterizes the training process of deep neural networks. We are driven by two apparen...
Understanding the loss surface of neural networks is essential for the design of models with predict...
© 7th International Conference on Learning Representations, ICLR 2019. All Rights Reserved. We inves...
The question of how and why the phenomenon of mode connectivity occurs in training deep neural netwo...
In this paper, we prove a conjecture published in 1989 and also partially address an open problem an...
This article presents a new criterion for convergence of gradient descent to a global minimum. The c...
The deep learning optimization community has observed how the neural networks generalization ability...