Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This vocabulary bottleneck limits the representational capabilities of multilingual models like XLM-R. In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, a mul...
Although multilingual pretrained models (mPLMs) enabled support of various natural language processi...
Large Language Models (LLMs) have demonstrated impressive performance on Natural Language Processing...
Large language models (LLMs)—machine learning algorithms that can recognize, summarize, translate,...
Multilingual language models are widely used to extend NLP systems to low-resource languages. Howeve...
Recent benchmarks for Large Language Models (LLMs) have mostly focused on application-driven tasks s...
Large Language Models (LLMs), trained predominantly on extensive English data, often exhibit limitat...
The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., makin...
Pretrained multilingual language models have become a common tool in transferring NLP capabilities t...
While pretrained language models (PLMs) primarily serve as general purpose text encoders that can be...
The recent success of LLMs has been predominantly driven by curating the training dataset compositio...
Scaling language models with more data, compute and parameters has driven significant progress in na...
Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuris...
How cross-linguistically applicable are NLP models, specifically language models? A fair comparison ...
International audienceTransfer learning based on pretraining language models on a large amount of ra...
Current multilingual vision-language models either require a large number of additional parameters f...
Although multilingual pretrained models (mPLMs) enabled support of various natural language processi...
Large Language Models (LLMs) have demonstrated impressive performance on Natural Language Processing...
Large language models (LLMs)—machine learning algorithms that can recognize, summarize, translate,...
Multilingual language models are widely used to extend NLP systems to low-resource languages. Howeve...
Recent benchmarks for Large Language Models (LLMs) have mostly focused on application-driven tasks s...
Large Language Models (LLMs), trained predominantly on extensive English data, often exhibit limitat...
The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., makin...
Pretrained multilingual language models have become a common tool in transferring NLP capabilities t...
While pretrained language models (PLMs) primarily serve as general purpose text encoders that can be...
The recent success of LLMs has been predominantly driven by curating the training dataset compositio...
Scaling language models with more data, compute and parameters has driven significant progress in na...
Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuris...
How cross-linguistically applicable are NLP models, specifically language models? A fair comparison ...
International audienceTransfer learning based on pretraining language models on a large amount of ra...
Current multilingual vision-language models either require a large number of additional parameters f...
Although multilingual pretrained models (mPLMs) enabled support of various natural language processi...
Large Language Models (LLMs) have demonstrated impressive performance on Natural Language Processing...
Large language models (LLMs)—machine learning algorithms that can recognize, summarize, translate,...