Batch gradient descent, ~w(t) = -7JdE/dw(t) , conver~es to a minimum of quadratic form with a time constant no better than '4Amax/ Amin where Amin and Amax are the minimum and maximum eigenvalues of the Hessian matrix of E with respect to w. It was recently shown that adding a momentum term ~w(t) = -7JdE/dw(t) + Q'~w(t - 1) improves this to ~ VAmax/ Amin, although only in the batch case. Here we show that secondorder momentum, ~w(t) = -7JdE/dw(t) + Q'~w(t -1) + (3~w(t - 2), can lower this no further. We then regard gradient descent with momentum as a dynamic system and explore a non quadratic error surface, showing that saturation of the error accounts for a variety of effects observed in simulations and justifies some popular heu...
Momentum methods have been shown to accelerate the convergence of the standard gradient descent algo...
Recently, Stochastic Gradient Descent (SGD) and its variants have become the dominant methods in the...
We derive two-point step sizes for the steepest-descent method by approximating the secant equation....
Batch gradient descent, ~w(t) = -7JdE/dw(t) , conver~es to a minimum of quadratic form with a time ...
Batch gradient descent, \Deltaw(t) = \GammajdE=dw(t), converges to a minimum of quadratic form with ...
There are a number of algorithms that can be categorized as gradient based. One such algorithm is th...
It might be inadequate for the line search technique for Newton's method to use only one floating po...
A momentum term is usually included in the simulations of connectionist learning algorithms. Althoug...
Abstract. During the last few decades, several papers were published about second-order opti-mizatio...
International audienceWe show that accelerated gradient descent, averaged gradient descent and the h...
We show that accelerated gradient descent, averaged gradient descent and the heavy-ball method for q...
This work shows that applying Gradient Descent (GD) with a fixed step size to minimize a (possibly n...
The study of first-order optimization is sensitive to the assumptions made on the objective function...
In recent years, it has become increasingly clear that the critical issue in gradient methods is the...
A momentum term is usually included in the simulations of connectionist learning algorithms. Althoug...
Momentum methods have been shown to accelerate the convergence of the standard gradient descent algo...
Recently, Stochastic Gradient Descent (SGD) and its variants have become the dominant methods in the...
We derive two-point step sizes for the steepest-descent method by approximating the secant equation....
Batch gradient descent, ~w(t) = -7JdE/dw(t) , conver~es to a minimum of quadratic form with a time ...
Batch gradient descent, \Deltaw(t) = \GammajdE=dw(t), converges to a minimum of quadratic form with ...
There are a number of algorithms that can be categorized as gradient based. One such algorithm is th...
It might be inadequate for the line search technique for Newton's method to use only one floating po...
A momentum term is usually included in the simulations of connectionist learning algorithms. Althoug...
Abstract. During the last few decades, several papers were published about second-order opti-mizatio...
International audienceWe show that accelerated gradient descent, averaged gradient descent and the h...
We show that accelerated gradient descent, averaged gradient descent and the heavy-ball method for q...
This work shows that applying Gradient Descent (GD) with a fixed step size to minimize a (possibly n...
The study of first-order optimization is sensitive to the assumptions made on the objective function...
In recent years, it has become increasingly clear that the critical issue in gradient methods is the...
A momentum term is usually included in the simulations of connectionist learning algorithms. Althoug...
Momentum methods have been shown to accelerate the convergence of the standard gradient descent algo...
Recently, Stochastic Gradient Descent (SGD) and its variants have become the dominant methods in the...
We derive two-point step sizes for the steepest-descent method by approximating the secant equation....