Masked language models conventionally use a masking rate of 15% due to the belief that more masking would provide insufficient context to learn good representations, and less masking would make training too expensive. Surprisingly, we find that masking up to 40% of input tokens can outperform the 15% baseline, and even masking 80% can preserve most of the performance, as measured by finetuning on downstream tasks. Increasing the masking rates has two distinct effects, which we investigate through careful ablations: (1) A larger proportion of input tokens are corrupted, reducing the context size and creating a harder task, and (2) models perform more predictions, which benefits training. We observe that larger models with more capacity to ta...
Large pre-trained language models are successfully being used in a variety of tasks, across many lan...
Masked language modeling (MLM), a self-supervised pretraining objective, is widely used in natural l...
Transformer-based autoregressive (AR) methods have achieved appealing performance for varied sequenc...
The current era of natural language processing (NLP) has been defined by the prominence of pre-train...
The reusability of state-of-the-art Pre-trained Language Models (PLMs) is often limited by their gen...
Masked Language Modeling (MLM) has proven to be an essential component of Vision-Language (VL) pretr...
Pre-training a language model and then fine-tuning it for downstream tasks has demonstrated state-of...
The current era of natural language processing (NLP) has been defined by the prominence of pre-train...
Word order, an essential property of natural languages, is injected in Transformer-based neural lang...
Masked Language Models (MLMs) have shown superior performances in numerous downstream Natural Langua...
Pre-trained language model (PTM) has been shown to yield powerful text representations for dense pas...
Large language models (LMs) are able to in-context learn -- perform a new task via inference alone b...
We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset ...
A fundamental challenge of over-parameterized deep learning models is learning meaningful data repre...
Transformer-based masked language models trained on general corpora, such as BERT and RoBERTa, have ...
Large pre-trained language models are successfully being used in a variety of tasks, across many lan...
Masked language modeling (MLM), a self-supervised pretraining objective, is widely used in natural l...
Transformer-based autoregressive (AR) methods have achieved appealing performance for varied sequenc...
The current era of natural language processing (NLP) has been defined by the prominence of pre-train...
The reusability of state-of-the-art Pre-trained Language Models (PLMs) is often limited by their gen...
Masked Language Modeling (MLM) has proven to be an essential component of Vision-Language (VL) pretr...
Pre-training a language model and then fine-tuning it for downstream tasks has demonstrated state-of...
The current era of natural language processing (NLP) has been defined by the prominence of pre-train...
Word order, an essential property of natural languages, is injected in Transformer-based neural lang...
Masked Language Models (MLMs) have shown superior performances in numerous downstream Natural Langua...
Pre-trained language model (PTM) has been shown to yield powerful text representations for dense pas...
Large language models (LMs) are able to in-context learn -- perform a new task via inference alone b...
We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset ...
A fundamental challenge of over-parameterized deep learning models is learning meaningful data repre...
Transformer-based masked language models trained on general corpora, such as BERT and RoBERTa, have ...
Large pre-trained language models are successfully being used in a variety of tasks, across many lan...
Masked language modeling (MLM), a self-supervised pretraining objective, is widely used in natural l...
Transformer-based autoregressive (AR) methods have achieved appealing performance for varied sequenc...