Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full fine-tuning. With the exception of fine-tuning, we find MoEs to be substantially more compute efficient. At more modest training budgets, MoEs can match the performance of dense models using $\sim$4 times less compute. This gap narrows at scale, but our largest MoE model (1.1T parameters) consistently outperforms a compute-equivalent dense model (6.7B parameters). Overall, this performance gap varies gr...
Mixture of experts (MoE) is a popular technique in deep learning that improves model capacity with c...
Despite their wide adoption, the underlying training and memorization dynamics of very large languag...
Computational language models (LMs), most notably exemplified by the widespread success of OpenAI's ...
As the training of giant dense models hits the boundary on the availability and capability of the ha...
Scaling language models with more data, compute and parameters has driven significant progress in na...
This paper reports on the benefits of largescale statistical language modeling in machine translatio...
Large language models (LLMs) based on transformers have made significant strides in recent years, th...
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capabilit...
Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnabl...
Modern language models leverage increasingly large numbers of parameters to achieve performance on n...
N-gram language models are an essential component in statistical natural language processing systems...
Language models demonstrate both quantitative improvement and new qualitative capabilities with incr...
All-MLP architectures have attracted increasing interest as an alternative to attention-based models...
This dissertation addresses two significant challenges of large language models (LLMs): robustness a...
Sparse mixture of expert architectures (MoEs) scale model capacity without large increases in traini...
Mixture of experts (MoE) is a popular technique in deep learning that improves model capacity with c...
Despite their wide adoption, the underlying training and memorization dynamics of very large languag...
Computational language models (LMs), most notably exemplified by the widespread success of OpenAI's ...
As the training of giant dense models hits the boundary on the availability and capability of the ha...
Scaling language models with more data, compute and parameters has driven significant progress in na...
This paper reports on the benefits of largescale statistical language modeling in machine translatio...
Large language models (LLMs) based on transformers have made significant strides in recent years, th...
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capabilit...
Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnabl...
Modern language models leverage increasingly large numbers of parameters to achieve performance on n...
N-gram language models are an essential component in statistical natural language processing systems...
Language models demonstrate both quantitative improvement and new qualitative capabilities with incr...
All-MLP architectures have attracted increasing interest as an alternative to attention-based models...
This dissertation addresses two significant challenges of large language models (LLMs): robustness a...
Sparse mixture of expert architectures (MoEs) scale model capacity without large increases in traini...
Mixture of experts (MoE) is a popular technique in deep learning that improves model capacity with c...
Despite their wide adoption, the underlying training and memorization dynamics of very large languag...
Computational language models (LMs), most notably exemplified by the widespread success of OpenAI's ...