Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many prior works aim to improve inference efficiency via compression techniques, e.g., pruning, these works do not explicitly address the computational challenges of training to downstream tasks. We introduce Learner modules and priming, novel methods for fine-tuning that exploit the overparameterization of pre-trained language models to gain benefits in convergence speed and resource utilization. Learner modules navigate the double bind of 1) training efficiently by fine-tuning a subset of parameters, and 2) training effectively by ensuring quick convergence and high metric scores. Our results on DistilBERT demonstrate that learners perform on par w...
When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown e...
Parameter-efficient tuning (PET) has been widely explored in recent years because it tunes much fewe...
With the dramatically increased number of parameters in language models, sparsity methods have recei...
We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset ...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
Gigantic pre-trained models have become central to natural language processing (NLP), serving as the...
Large pre-trained language models have recently gained significant traction due to their improved pe...
There are growing interests in adapting large-scale language models using parameter-efficient fine-t...
Language model fine-tuning is essential for modern natural language processing, but is computational...
In this paper, we move towards combining large parametric models with non-parametric prototypical ne...
Parameter-shared pre-trained language models (PLMs) have emerged as a successful approach in resourc...
Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnabl...
We consider the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fin...
Transformer-based language models have become a key building block for natural language processing. ...
The growing size of neural language models has led to increased attention in model compression. The ...
When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown e...
Parameter-efficient tuning (PET) has been widely explored in recent years because it tunes much fewe...
With the dramatically increased number of parameters in language models, sparsity methods have recei...
We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset ...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
Gigantic pre-trained models have become central to natural language processing (NLP), serving as the...
Large pre-trained language models have recently gained significant traction due to their improved pe...
There are growing interests in adapting large-scale language models using parameter-efficient fine-t...
Language model fine-tuning is essential for modern natural language processing, but is computational...
In this paper, we move towards combining large parametric models with non-parametric prototypical ne...
Parameter-shared pre-trained language models (PLMs) have emerged as a successful approach in resourc...
Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnabl...
We consider the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fin...
Transformer-based language models have become a key building block for natural language processing. ...
The growing size of neural language models has led to increased attention in model compression. The ...
When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown e...
Parameter-efficient tuning (PET) has been widely explored in recent years because it tunes much fewe...
With the dramatically increased number of parameters in language models, sparsity methods have recei...