Robust Training of Neural Networks Using Scale Invariant Architectures

Li, Zhiyuan
Bhojanapalli, Srinadh
Zaheer, Manzil
Reddi, Sashank J.
Kumar, Sanjiv

Publication date

July 2022

Abstract

In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models. However, the use of adaptivity not only comes at the cost of extra memory but also raises the fundamental question: can non-adaptive methods like SGD enjoy similar benefits? In this paper, we provide an affirmative answer to this question by proposing to achieve both robust and memory-efficient training via the following general recipe: (1) modify the architecture and make it scale invariant, i.e. the scale of parameter doesn't affect the output of the network, (2) train with SGD and weight decay, and optionally (3) clip the global gradient norm proportional to weight norm multiplied by $\sqrt{\tfrac{2\lam...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Robust Training of Neural Networks Using Scale Invariant Architectures

Abstract

Extracted data

Robust Training of Neural Networks Using Scale Invariant Architectures

Abstract

Extracted data

Related items

Related items