The computation necessary for training Transformer-based language models has skyrocketed in recent years. This trend has motivated research on efficient training algorithms designed to improve training, validation, and downstream performance faster than standard training. In this work, we revisit three categories of such algorithms: dynamic architectures (layer stacking, layer dropping), batch selection (selective backprop, RHO loss), and efficient optimizers (Lion, Sophia). When pre-training BERT and T5 with a fixed computation budget using such methods, we find that their training, validation, and downstream gains vanish compared to a baseline with a fully-decayed learning rate. We define an evaluation protocol that enables computation to...
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive ...
Transformer-based masked language models trained on general corpora, such as BERT and RoBERTa, have ...
Methods for improving the efficiency of deep network training (i.e. the resources required to achiev...
Transformer-based neural models are used in many AI applications. Training these models is expensive...
Pruning is an effective way to reduce the huge inference cost of Transformer models. However, prior ...
Recent trends in language modeling have focused on increasing performance through scaling, and have ...
Recently, the development of pre-trained language models has brought natural language processing (NL...
This document aims to be a self-contained, mathematically precise overview of transformer architectu...
Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DN...
There has been an explosion of interest in designing high-performance Transformers. While Transforme...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
Large-scale transformer models have become the de-facto architectures for various machine learning a...
The Transformer architecture is ubiquitously used as the building block of large-scale autoregressiv...
Transformer models are widely used in AI applications such as Natural Language Processing (NLP), Com...
Solid results from Transformers have made them prevailing architectures in various natural language ...
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive ...
Transformer-based masked language models trained on general corpora, such as BERT and RoBERTa, have ...
Methods for improving the efficiency of deep network training (i.e. the resources required to achiev...
Transformer-based neural models are used in many AI applications. Training these models is expensive...
Pruning is an effective way to reduce the huge inference cost of Transformer models. However, prior ...
Recent trends in language modeling have focused on increasing performance through scaling, and have ...
Recently, the development of pre-trained language models has brought natural language processing (NL...
This document aims to be a self-contained, mathematically precise overview of transformer architectu...
Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DN...
There has been an explosion of interest in designing high-performance Transformers. While Transforme...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
Large-scale transformer models have become the de-facto architectures for various machine learning a...
The Transformer architecture is ubiquitously used as the building block of large-scale autoregressiv...
Transformer models are widely used in AI applications such as Natural Language Processing (NLP), Com...
Solid results from Transformers have made them prevailing architectures in various natural language ...
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive ...
Transformer-based masked language models trained on general corpora, such as BERT and RoBERTa, have ...
Methods for improving the efficiency of deep network training (i.e. the resources required to achiev...