A matrix free and a low rank approximation preconditioner are proposed to accelerate the convergence of stochastic gradient descent (SGD) by exploiting curvature information sampled from Hessian-vector products or finite differences of parameters and gradients similar to the BFGS algorithm. Both preconditioners are fitted with an online updating manner minimizing a criterion that is free of line search and robust to stochastic gradient noise, and further constrained to be on certain connected Lie groups to preserve their corresponding symmetry or invariance, e.g., orientation of coordinates by the connected general linear group with positive determinants. The Lie group's equivariance property facilitates preconditioner fitting, and its inva...
Is Stochastic Gradient Descent (SGD) substantially different from Glauber dynamics? This is a fundam...
The goal of this paper is to debunk and dispel the magic behind black-box optimizers and stochastic ...
Current machine learning practice requires solving huge-scale empirical risk minimization problems q...
Despite the recent growth of theoretical studies and empirical successes of neural networks, gradien...
Gaussian process hyperparameter optimization requires linear solves with, and log-determinants of, l...
Stochastic Gradient Descent (SGD) is an out-of-equilibrium algorithm used extensively to train artif...
This paper introduces PROMISE ($\textbf{Pr}$econditioned Stochastic $\textbf{O}$ptimization $\textbf...
Stochastic Gradient Descent (SGD) has played a crucial role in the success of modern machine learnin...
Stochastic Gradient Descent-Ascent (SGDA) is one of the most prominent algorithms for solving min-ma...
Stochastic gradient descent (SGD) and its variants have established themselves as the go-to algorith...
The stochastic gradient descent (SGD) algorithm has been widely used in statistical estimation for l...
We propose a new per-layer adaptive step-size procedure for stochastic first-order optimization meth...
Stochastic gradient descent (SGD) holds as a classical method to build large scale machine learning ...
Recently, Stochastic Gradient Descent (SGD) and its variants have become the dominant methods in the...
Abstract: The conjugate gradient method (CG) is usually used with a preconditioner which i...
Is Stochastic Gradient Descent (SGD) substantially different from Glauber dynamics? This is a fundam...
The goal of this paper is to debunk and dispel the magic behind black-box optimizers and stochastic ...
Current machine learning practice requires solving huge-scale empirical risk minimization problems q...
Despite the recent growth of theoretical studies and empirical successes of neural networks, gradien...
Gaussian process hyperparameter optimization requires linear solves with, and log-determinants of, l...
Stochastic Gradient Descent (SGD) is an out-of-equilibrium algorithm used extensively to train artif...
This paper introduces PROMISE ($\textbf{Pr}$econditioned Stochastic $\textbf{O}$ptimization $\textbf...
Stochastic Gradient Descent (SGD) has played a crucial role in the success of modern machine learnin...
Stochastic Gradient Descent-Ascent (SGDA) is one of the most prominent algorithms for solving min-ma...
Stochastic gradient descent (SGD) and its variants have established themselves as the go-to algorith...
The stochastic gradient descent (SGD) algorithm has been widely used in statistical estimation for l...
We propose a new per-layer adaptive step-size procedure for stochastic first-order optimization meth...
Stochastic gradient descent (SGD) holds as a classical method to build large scale machine learning ...
Recently, Stochastic Gradient Descent (SGD) and its variants have become the dominant methods in the...
Abstract: The conjugate gradient method (CG) is usually used with a preconditioner which i...
Is Stochastic Gradient Descent (SGD) substantially different from Glauber dynamics? This is a fundam...
The goal of this paper is to debunk and dispel the magic behind black-box optimizers and stochastic ...
Current machine learning practice requires solving huge-scale empirical risk minimization problems q...