GLaM: Efficient Scaling of Language Models with Mixture-of-Experts | ORKG Ask

We use cookies to provide a better user experience.

Data Protection

Related items

Few-shot Learning with Multilingual Language Models

Lin, Xi Victoria
Mihaylov, Todor
Artetxe, Mikel
Wang, Tianlu
Chen, Shuohui
Simig, Daniel
Ott, Myle
Goyal, Naman
Bhosale, Shruti
Du, Jingfei
Pasunuru, Ramakanth
Shleifer, Sam
Koura, Punit Singh
Chaudhary, Vishrav
O'Horo, Brian
Wang, Jeff
Zettlemoyer, Luke
Kozareva, Zornitsa
Diab, Mona
Stoyanov, Veselin
Li, Xian

November 2022

Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these ...

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

June 2022

Language models demonstrate both quantitative improvement and new qualitative capabilities with incr...

It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

Schick, Timo
Schütze, Hinrich
Toutanova, Kristina

June 2021

When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown e...

Efficient Large Scale Language Modeling with Mixtures of Experts

Artetxe, Mikel
Bhosale, Shruti
Goyal, Naman
Mihaylov, Todor
Ott, Myle
Shleifer, Sam
Lin, Xi Victoria
Du, Jingfei
Iyer, Srinivasan
Pasunuru, Ramakanth
Anantharaman, Giri
Li, Xian
Chen, Shuohui
Akin, Halil
Baines, Mandeep
Martin, Louis
Zhou, Xing
Koura, Punit Singh
O'Horo, Brian
Wang, Jeff
Zettlemoyer, Luke
Diab, Mona
Kozareva, Zornitsa
Stoyanov, Ves

December 2021

Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional com...

SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models

Thangarasa, Vithursan
Gupta, Abhay
Marshall, William
Li, Tianda
Leong, Kevin
DeCoste, Dennis
Lie, Sean
Saxena, Shreyas

July 2023

The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...

Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models

Shen, Sheng
Hou, Le
Zhou, Yanqi
Du, Nan
Longpre, Shayne
Wei, Jason
Chung, Hyung Won
Zoph, Barret
Fedus, William
Chen, Xinyun
Vu, Tu
Wu, Yuexin
Chen, Wuyang
Webson, Albert
Li, Yunxuan
Zhao, Vincent
Yu, Hongkun
Keutzer, Kurt
Darrell, Trevor
Zhou, Denny

July 2023

Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnabl...

What Language Model to Train if You Have One Million GPU Hours?

Scao, Teven Le
Wang, Thomas
Hesslow, Daniel
Saulnier, Lucile
Bekman, Stas
Bari, M Saiful
Biderman, Stella
Elsahar, Hady
Muennighoff, Niklas
Phang, Jason
Press, Ofir
Raffel, Colin
Sanh, Victor
Shen, Sheng
Sutawika, Lintang
Tae, Jaesung
Yong, Zheng Xin
Launay, Julien
Beltagy, Iz

November 2022

The crystallization of modeling methods around the Transformer architecture has been a boon for prac...

Few-shot Learning with Multilingual Language Models

Lin, Xi Victoria
Mihaylov, Todor
Artetxe, Mikel
Wang, Tianlu
Chen, Shuohui
Simig, Daniel
Ott, Myle
Goyal, Naman
Bhosale, Shruti
Du, Jingfei
Pasunuru, Ramakanth
Shleifer, Sam
Koura, Punit Singh
Chaudhary, Vishrav
O'Horo, Brian
Wang, Jeff
Zettlemoyer, Luke
Kozareva, Zornitsa
Diab, Mona
Stoyanov, Veselin
Li, Xian

November 2022

Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these ...

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

June 2022

Language models demonstrate both quantitative improvement and new qualitative capabilities with incr...

Efficient Large Scale Language Modeling with Mixtures of Experts

Artetxe, Mikel
Bhosale, Shruti
Goyal, Naman
Mihaylov, Todor
Ott, Myle
Shleifer, Sam
Lin, Xi Victoria
Du, Jingfei
Iyer, Srinivasan
Pasunuru, Ramakanth
Anantharaman, Giri
Li, Xian
Chen, Shuohui
Akin, Halil
Baines, Mandeep
Martin, Louis
Zhou, Xing
Koura, Punit Singh
O'Horo, Brian
Wang, Jeff
Zettlemoyer, Luke
Diab, Mona
Kozareva, Zornitsa
Stoyanov, Ves

December 2021

Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional com...

Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models

Shen, Sheng
Hou, Le
Zhou, Yanqi
Du, Nan
Longpre, Shayne
Wei, Jason
Chung, Hyung Won
Zoph, Barret
Fedus, William
Chen, Xinyun
Vu, Tu
Wu, Yuexin
Chen, Wuyang
Webson, Albert
Li, Yunxuan
Zhao, Vincent
Yu, Hongkun
Keutzer, Kurt
Darrell, Trevor
Zhou, Denny

July 2023

Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnabl...

What Language Model to Train if You Have One Million GPU Hours?

Scao, Teven Le
Wang, Thomas
Hesslow, Daniel
Saulnier, Lucile
Bekman, Stas
Bari, M Saiful
Biderman, Stella
Elsahar, Hady
Muennighoff, Niklas
Phang, Jason
Press, Ofir
Raffel, Colin
Sanh, Victor
Shen, Sheng
Sutawika, Lintang
Tae, Jaesung
Yong, Zheng Xin
Launay, Julien
Beltagy, Iz

November 2022

The crystallization of modeling methods around the Transformer architecture has been a boon for prac...

Few-shot Learning with Multilingual Language Models

Lin, Xi Victoria
Mihaylov, Todor
Artetxe, Mikel
Wang, Tianlu
Chen, Shuohui
Simig, Daniel
Ott, Myle
Goyal, Naman
Bhosale, Shruti
Du, Jingfei
Pasunuru, Ramakanth
Shleifer, Sam
Koura, Punit Singh
Chaudhary, Vishrav
O'Horo, Brian
Wang, Jeff
Zettlemoyer, Luke
Kozareva, Zornitsa
Diab, Mona
Stoyanov, Veselin
Li, Xian

November 2022

Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these ...

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

June 2022

Language models demonstrate both quantitative improvement and new qualitative capabilities with incr...