Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for training small models within a multi-task framework. We present three findings across 4...
Large language models (LLMs) have shown incredible performance in completing various real-world task...
Prior work shows that it is possible to expand pretrained Masked Language Models (MLMs) to new langu...
The recent surge of generative AI has been fueled by the generative power of diffusion probabilistic...
When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown e...
Leveraging shared learning through Massively Multilingual Models, state-of-the-art machine translati...
Scaling language models with more data, compute and parameters has driven significant progress in na...
We present a new method LiST is short for Lite Prompted Self-Training for parameter-efficient fine-t...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
Pretrained large language models (LLMs) are widely used in many sub-fields of natural language proce...
Large language models (LLMs) have achieved remarkable advancements in the field of natural language ...
Pretrained language models have become the standard approach for many NLP tasks due to strong perfor...
Thesis (Ph.D.)--University of Washington, 2023Language models (LMs) are at the core of almost all st...
Pretrained large language models (LLMs) are strong in-context learners that are able to perform few-...
Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnabl...
Language models (LMs) with less than 100B parameters are known to perform poorly on chain-of-thought...
Large language models (LLMs) have shown incredible performance in completing various real-world task...
Prior work shows that it is possible to expand pretrained Masked Language Models (MLMs) to new langu...
The recent surge of generative AI has been fueled by the generative power of diffusion probabilistic...
When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown e...
Leveraging shared learning through Massively Multilingual Models, state-of-the-art machine translati...
Scaling language models with more data, compute and parameters has driven significant progress in na...
We present a new method LiST is short for Lite Prompted Self-Training for parameter-efficient fine-t...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
Pretrained large language models (LLMs) are widely used in many sub-fields of natural language proce...
Large language models (LLMs) have achieved remarkable advancements in the field of natural language ...
Pretrained language models have become the standard approach for many NLP tasks due to strong perfor...
Thesis (Ph.D.)--University of Washington, 2023Language models (LMs) are at the core of almost all st...
Pretrained large language models (LLMs) are strong in-context learners that are able to perform few-...
Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnabl...
Language models (LMs) with less than 100B parameters are known to perform poorly on chain-of-thought...
Large language models (LLMs) have shown incredible performance in completing various real-world task...
Prior work shows that it is possible to expand pretrained Masked Language Models (MLMs) to new langu...
The recent surge of generative AI has been fueled by the generative power of diffusion probabilistic...