Extrapolation for Large-batch Training in Deep Learning

Lin, Tao
Kong, Lingjing
Stich, Sebastian Urban
Jaggi, Martin

Publication date

June 2021

Abstract

Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data. A major roadblock faced when increasing the batch size to a substantial fraction of the training data for reducing training time is the persistent degradation in performance (generalization gap). To address this issue, recent work propose to add small perturbations to the model parameters when computing the stochastic gradients and report improved generalization performance due to smoothing effects. However, this approach is poorly understood; it requires often model-specific noise and fine-tuning. To alleviate these drawbacks, we propo...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Extrapolation for Large-batch Training in Deep Learning

Abstract

Extracted data

Extrapolation for Large-batch Training in Deep Learning

Abstract

Extracted data

Related items

Related items