We evaluate three simple, normalization-centric changes to improve Transformer training. First, we show that pre-norm residual connections (PRENORM) and smaller initializations enable warmup-free, validation-based training with large learning rates. Second, we propose l2 normalization with a single scale parameter (SCALENORM) for faster training and better performance. Finally, we reaffirm the effectiveness of normalizing word embeddings to a fixed length (FIXNORM). On five low-resource translation pairs from TED Talks-based corpora, these changes always converge, giving an average +1.1 BLEU over state-of-the-art bilingual baselines and a new 32.8 BLEU on IWSLT '15 English-Vietnamese. We ob- serve sharper performance curves, more consistent...
Recently, it has been argued that encoder-decoder models can be made more interpretable by replacing...
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by ...
Thesis (Master's)--University of Washington, 2021Transformer models perform well on NLP tasks, but r...
Self-supervised training methods for transformers have demonstrated remarkable performance across va...
Solid results from Transformers have made them prevailing architectures in various natural language ...
The general trend in NLP is towards increasing model capacity and performance via deeper neural netw...
The powerful modeling capabilities of all-attention-based transformer architectures often cause over...
Transformer-based models have brought a radical change to neural machine translation. A key feature ...
Machine translation has received significant attention in the field of natural language processing n...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Text-to-Speech (TTS) normalization is an essential component of natural language processing (NLP) th...
Pre-trained transformers have rapidly become very popular in the Natural Language Processing (NLP) c...
We explore the suitability of self-attention models for character-level neural machine translation. ...
This article describes our experiments in neural machine translation using the recent Tensor2Tensor ...
Abstract We present weight normalization: a reparameterization of the weight vectors in a neural net...
Recently, it has been argued that encoder-decoder models can be made more interpretable by replacing...
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by ...
Thesis (Master's)--University of Washington, 2021Transformer models perform well on NLP tasks, but r...
Self-supervised training methods for transformers have demonstrated remarkable performance across va...
Solid results from Transformers have made them prevailing architectures in various natural language ...
The general trend in NLP is towards increasing model capacity and performance via deeper neural netw...
The powerful modeling capabilities of all-attention-based transformer architectures often cause over...
Transformer-based models have brought a radical change to neural machine translation. A key feature ...
Machine translation has received significant attention in the field of natural language processing n...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Text-to-Speech (TTS) normalization is an essential component of natural language processing (NLP) th...
Pre-trained transformers have rapidly become very popular in the Natural Language Processing (NLP) c...
We explore the suitability of self-attention models for character-level neural machine translation. ...
This article describes our experiments in neural machine translation using the recent Tensor2Tensor ...
Abstract We present weight normalization: a reparameterization of the weight vectors in a neural net...
Recently, it has been argued that encoder-decoder models can be made more interpretable by replacing...
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by ...
Thesis (Master's)--University of Washington, 2021Transformer models perform well on NLP tasks, but r...