Pretrained language models (PLMs) are today the primary model for natural language processing. Despite their impressive downstream performance, it can be difficult to apply PLMs to new languages, a barrier to making their capabilities universally accessible. While prior work has shown it possible to address this issue by learning a new embedding layer for the new language, doing so is both data and compute inefficient. We propose to use an active forgetting mechanism during pretraining, as a simple way of creating PLMs that can quickly adapt to new languages. Concretely, by resetting the embedding layer every K updates during pretraining, we encourage the PLM to improve its ability of learning new embeddings within a limited number of updat...
The lifelong learning paradigm in machine learning is an attractive alternative to the more prominen...
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to...
We introduce FLOTA (Few Longest Token Approximation), a simple yet effective method to improve the t...
This paper considers continual learning of large-scale pretrained neural machine translation model w...
Recent advances in NLP are brought by a range of large-scale pretrained language models (PLMs). Thes...
Pretrained language models (PTLMs) are typically learned over a large, static corpus and further fin...
Prior work shows that it is possible to expand pretrained Masked Language Models (MLMs) to new langu...
Pretraining multilingual language models from scratch requires considerable computational resources ...
Multilingual speech recognition with neural networks is often implemented with batch-learning, when ...
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the pr...
Pre-trained language models (PLMs) have demonstrated impressive performance across various downstrea...
Continual pretraining is a standard way of building a domain-specific pretrained language model from...
In real scenarios, a multilingual model trained to solve NLP tasks on a set of languages can be requ...
GPT-2 and BERT demonstrate the effectiveness of using pre-trained language models (LMs) on various n...
The reusability of state-of-the-art Pre-trained Language Models (PLMs) is often limited by their gen...
The lifelong learning paradigm in machine learning is an attractive alternative to the more prominen...
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to...
We introduce FLOTA (Few Longest Token Approximation), a simple yet effective method to improve the t...
This paper considers continual learning of large-scale pretrained neural machine translation model w...
Recent advances in NLP are brought by a range of large-scale pretrained language models (PLMs). Thes...
Pretrained language models (PTLMs) are typically learned over a large, static corpus and further fin...
Prior work shows that it is possible to expand pretrained Masked Language Models (MLMs) to new langu...
Pretraining multilingual language models from scratch requires considerable computational resources ...
Multilingual speech recognition with neural networks is often implemented with batch-learning, when ...
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the pr...
Pre-trained language models (PLMs) have demonstrated impressive performance across various downstrea...
Continual pretraining is a standard way of building a domain-specific pretrained language model from...
In real scenarios, a multilingual model trained to solve NLP tasks on a set of languages can be requ...
GPT-2 and BERT demonstrate the effectiveness of using pre-trained language models (LMs) on various n...
The reusability of state-of-the-art Pre-trained Language Models (PLMs) is often limited by their gen...
The lifelong learning paradigm in machine learning is an attractive alternative to the more prominen...
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to...
We introduce FLOTA (Few Longest Token Approximation), a simple yet effective method to improve the t...