Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we inves...
Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many pri...
Transformer-based masked language models trained on general corpora, such as BERT and RoBERTa, have ...
Distilling state-of-the-art transformer models into lightweight student models is an effective way t...
The crystallization of modeling methods around the Transformer architecture has been a boon for prac...
When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown e...
Scaling language models with more data, compute and parameters has driven significant progress in na...
In recent years, the number of parameters of one deep learning (DL) model has been growing much fast...
Transformer-based neural models are used in many AI applications. Training these models is expensive...
The computation necessary for training Transformer-based language models has skyrocketed in recent y...
Abstract. One of the major research trends currently is the evolution of heterogeneous parallel comp...
Multi-task learning (MTL), instruction tuning, and prompting have recently been shown to improve the...
Recent works have demonstrated great success in pre-training large-scale autoregressive language mod...
Deep learning models are trained on servers with many GPUs, andtraining must scale with the number o...
The pre-trained model (PTM) is revolutionizing Artificial Intelligence (AI) technology. However, the...
Deep learning's recent history has been one of achievement: from triumphing over humans in the game ...
Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many pri...
Transformer-based masked language models trained on general corpora, such as BERT and RoBERTa, have ...
Distilling state-of-the-art transformer models into lightweight student models is an effective way t...
The crystallization of modeling methods around the Transformer architecture has been a boon for prac...
When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown e...
Scaling language models with more data, compute and parameters has driven significant progress in na...
In recent years, the number of parameters of one deep learning (DL) model has been growing much fast...
Transformer-based neural models are used in many AI applications. Training these models is expensive...
The computation necessary for training Transformer-based language models has skyrocketed in recent y...
Abstract. One of the major research trends currently is the evolution of heterogeneous parallel comp...
Multi-task learning (MTL), instruction tuning, and prompting have recently been shown to improve the...
Recent works have demonstrated great success in pre-training large-scale autoregressive language mod...
Deep learning models are trained on servers with many GPUs, andtraining must scale with the number o...
The pre-trained model (PTM) is revolutionizing Artificial Intelligence (AI) technology. However, the...
Deep learning's recent history has been one of achievement: from triumphing over humans in the game ...
Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many pri...
Transformer-based masked language models trained on general corpora, such as BERT and RoBERTa, have ...
Distilling state-of-the-art transformer models into lightweight student models is an effective way t...