We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset of them) are being modified. We show that with small-to-medium training data, applying BitFit on pre-trained BERT models is competitive with (and sometimes better than) fine-tuning the entire model. For larger data, the method is competitive with other sparse fine-tuning methods. Besides their practical utility, these findings are relevant for the question of understanding the commonly-used process of finetuning: they support the hypothesis that finetuning is mainly about exposing knowledge induced by language-modeling training, rather than learning new task-specific linguistic knowledge.Comment: Accepted at ACL 2022 main conferenc
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
Parameter-efficient fine-tuning methods (PEFTs) offer the promise of adapting large pre-trained mode...
Widely used language models (LMs) are typically built by scaling up a two-stage training pipeline: a...
Language model fine-tuning is essential for modern natural language processing, but is computational...
Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many pri...
Transformer-based pre-trained models with millions of parameters require large storage. Recent appro...
Existing fine-tuning methods either tune all parameters of the pre-trained model (full fine-tuning),...
In this paper, we move towards combining large parametric models with non-parametric prototypical ne...
Fine-tuning the entire set of parameters of a large pretrained model has become the mainstream appro...
Large pre-trained language models have recently gained significant traction due to their improved pe...
There are growing interests in adapting large-scale language models using parameter-efficient fine-t...
Pre-training a language model and then fine-tuning it for downstream tasks has demonstrated state-of...
Gigantic pre-trained models have become central to natural language processing (NLP), serving as the...
Pretrained Transformers achieve state-of-the-art performance in various code-processing tasks but ma...
Adopting a two-stage paradigm of pretraining followed by fine-tuning, Pretrained Language Models (PL...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
Parameter-efficient fine-tuning methods (PEFTs) offer the promise of adapting large pre-trained mode...
Widely used language models (LMs) are typically built by scaling up a two-stage training pipeline: a...
Language model fine-tuning is essential for modern natural language processing, but is computational...
Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many pri...
Transformer-based pre-trained models with millions of parameters require large storage. Recent appro...
Existing fine-tuning methods either tune all parameters of the pre-trained model (full fine-tuning),...
In this paper, we move towards combining large parametric models with non-parametric prototypical ne...
Fine-tuning the entire set of parameters of a large pretrained model has become the mainstream appro...
Large pre-trained language models have recently gained significant traction due to their improved pe...
There are growing interests in adapting large-scale language models using parameter-efficient fine-t...
Pre-training a language model and then fine-tuning it for downstream tasks has demonstrated state-of...
Gigantic pre-trained models have become central to natural language processing (NLP), serving as the...
Pretrained Transformers achieve state-of-the-art performance in various code-processing tasks but ma...
Adopting a two-stage paradigm of pretraining followed by fine-tuning, Pretrained Language Models (PL...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
Parameter-efficient fine-tuning methods (PEFTs) offer the promise of adapting large pre-trained mode...
Widely used language models (LMs) are typically built by scaling up a two-stage training pipeline: a...