Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest GLaM has 1.2 trillion parameters, which is approximately 7x larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-...
Thesis (Ph.D.)--University of Washington, 2023Language models (LMs) are at the core of almost all st...
Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these ...
Language models demonstrate both quantitative improvement and new qualitative capabilities with incr...
When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown e...
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional com...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
The recent advance of self-supervised learning associated with the Transformer architecture enables ...
As the training of giant dense models hits the boundary on the availability and capability of the ha...
The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., makin...
Large language models (LLMs) are a special class of pretrained language models obtained by scaling m...
Deploying large language models (LLMs) is challenging because they are memory inefficient and comput...
Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnabl...
With the dramatically increased number of parameters in language models, sparsity methods have recei...
The recent surge of generative AI has been fueled by the generative power of diffusion probabilistic...
The crystallization of modeling methods around the Transformer architecture has been a boon for prac...
Thesis (Ph.D.)--University of Washington, 2023Language models (LMs) are at the core of almost all st...
Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these ...
Language models demonstrate both quantitative improvement and new qualitative capabilities with incr...
When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown e...
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional com...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
The recent advance of self-supervised learning associated with the Transformer architecture enables ...
As the training of giant dense models hits the boundary on the availability and capability of the ha...
The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., makin...
Large language models (LLMs) are a special class of pretrained language models obtained by scaling m...
Deploying large language models (LLMs) is challenging because they are memory inefficient and comput...
Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnabl...
With the dramatically increased number of parameters in language models, sparsity methods have recei...
The recent surge of generative AI has been fueled by the generative power of diffusion probabilistic...
The crystallization of modeling methods around the Transformer architecture has been a boon for prac...
Thesis (Ph.D.)--University of Washington, 2023Language models (LMs) are at the core of almost all st...
Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these ...
Language models demonstrate both quantitative improvement and new qualitative capabilities with incr...