Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instructiontuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks. In the...
We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset ...
Parameter-efficient fine-tuning (PEFT) methods can adapt large language models to downstream tasks b...
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capabilit...
The Mixture of Experts (MoE) is a widely known neural architecture where an ensemble of specialized ...
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks de...
As the training of giant dense models hits the boundary on the availability and capability of the ha...
Scaling language models with more data, compute and parameters has driven significant progress in na...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
In this work, we evaluate 10 open-source instructed LLMs on four representative code comprehension a...
Machine learning models based on the aggregated outputs of submodels, either at the activation or pr...
Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many pri...
Widely used language models (LMs) are typically built by scaling up a two-stage training pipeline: a...
High-quality instruction-tuning data is critical to improving LLM capabilities. Existing data collec...
Recently, Instruction fine-tuning has risen to prominence as a potential method for enhancing the ze...
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional com...
We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset ...
Parameter-efficient fine-tuning (PEFT) methods can adapt large language models to downstream tasks b...
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capabilit...
The Mixture of Experts (MoE) is a widely known neural architecture where an ensemble of specialized ...
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks de...
As the training of giant dense models hits the boundary on the availability and capability of the ha...
Scaling language models with more data, compute and parameters has driven significant progress in na...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
In this work, we evaluate 10 open-source instructed LLMs on four representative code comprehension a...
Machine learning models based on the aggregated outputs of submodels, either at the activation or pr...
Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many pri...
Widely used language models (LMs) are typically built by scaling up a two-stage training pipeline: a...
High-quality instruction-tuning data is critical to improving LLM capabilities. Existing data collec...
Recently, Instruction fine-tuning has risen to prominence as a potential method for enhancing the ze...
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional com...
We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset ...
Parameter-efficient fine-tuning (PEFT) methods can adapt large language models to downstream tasks b...
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capabilit...