Modern language models leverage increasingly large numbers of parameters to achieve performance on natural language understanding tasks. Ensembling these models in specific configurations for downstream tasks show even further performance improvements. In this paper, we perform an analysis of bagging language models and compare single language models to bagged ensembles that are roughly equivalent in terms of final model size. We explore an array of model bagging configurations for natural language understanding tasks with final ensemble sizes ranging from 300M parameters to 1.5B parameters and determine that our ensembling methods are at best roughly equivalent to single LM baselines. We note other positive effects of bagging and pruning i...
We first present skip n-grams interpolated with various other n-grams and measure their ability to i...
Large multilingual language models typically rely on a single vocabulary shared across 100+ language...
Language modeling has been widely used in the application of natural language processing, and there...
When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown e...
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional com...
Scaling existing applications and solutions to multiple human languages has traditionally proven to ...
Scaling language models with more data, compute and parameters has driven significant progress in na...
Thesis (Ph.D.)--University of Washington, 2023Language models (LMs) are at the core of almost all st...
How cross-linguistically applicable are NLP models, specifically language models? A fair comparison ...
Multilingual models are often particularly dependent on scaling to generalize to a growing number of...
Multilingual language models are widely used to extend NLP systems to low-resource languages. Howeve...
Language models demonstrate both quantitative improvement and new qualitative capabilities with incr...
Real-world business applications require a trade-off between language model performance and size. We...
Despite their wide adoption, the underlying training and memorization dynamics of very large languag...
State-of-the-art natural language processing models are anything but compact. Syntactic parsers have...
We first present skip n-grams interpolated with various other n-grams and measure their ability to i...
Large multilingual language models typically rely on a single vocabulary shared across 100+ language...
Language modeling has been widely used in the application of natural language processing, and there...
When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown e...
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional com...
Scaling existing applications and solutions to multiple human languages has traditionally proven to ...
Scaling language models with more data, compute and parameters has driven significant progress in na...
Thesis (Ph.D.)--University of Washington, 2023Language models (LMs) are at the core of almost all st...
How cross-linguistically applicable are NLP models, specifically language models? A fair comparison ...
Multilingual models are often particularly dependent on scaling to generalize to a growing number of...
Multilingual language models are widely used to extend NLP systems to low-resource languages. Howeve...
Language models demonstrate both quantitative improvement and new qualitative capabilities with incr...
Real-world business applications require a trade-off between language model performance and size. We...
Despite their wide adoption, the underlying training and memorization dynamics of very large languag...
State-of-the-art natural language processing models are anything but compact. Syntactic parsers have...
We first present skip n-grams interpolated with various other n-grams and measure their ability to i...
Large multilingual language models typically rely on a single vocabulary shared across 100+ language...
Language modeling has been widely used in the application of natural language processing, and there...