Multilingual pre-trained language models have shown impressive performance on cross-lingual tasks. It greatly facilitates the applications of natural language processing on low-resource languages. However, there are still some languages that the existing multilingual models do not perform well on. In this paper, we propose CINO (Chinese Minority Pre-trained Language Model), a multilingual pre-trained language model for Chinese minority languages. It covers Standard Chinese, Cantonese, and six other Chinese minority languages. To evaluate the cross-lingual ability of the multilingual models on the minority languages, we collect documents from Wikipedia and build a text classification dataset WCM (Wiki-Chinese-Minority). We test CINO on WCM a...
Offensive language detection is increasingly crucial for maintaining a civilized social media platfo...
Natural language processing (NLP) has a significant impact on society via technologies such as machi...
Scaling multilingual representation learning beyond the hundred most frequent languages is challengi...
It is hard to collect corpora used to train good language models for many minority languages. Canton...
With more than 1.3 billion people, China exhibits an array of diversity in many domains, including e...
English pretrained language models, which make up the backbone of many modern NLP systems, require h...
Realizing general-purpose language intelligence has been a longstanding goal for natural language pr...
Although there are increasing and significant ties between China and Portuguese-speaking countries, ...
Large-scale corpora play a vital role in the construction of large language models (LLMs). However, ...
Producing machine translation (MT) for the many minority languages in the world is a serious challen...
In this paper we share findings from our effort to build practical machine translation (MT) systems ...
[Extract] Do English-Chinese bilinguals process their languages differently from monolinguals? What ...
China is a very multicultural and multiethnic country with a huge number of languages and dialects w...
In this paper, we introduce a massively multilingual speech corpora with fine-grained phonemic trans...
To investigate the role of linguistic knowledge in data augmentation (DA) for Natural Language Proce...
Offensive language detection is increasingly crucial for maintaining a civilized social media platfo...
Natural language processing (NLP) has a significant impact on society via technologies such as machi...
Scaling multilingual representation learning beyond the hundred most frequent languages is challengi...
It is hard to collect corpora used to train good language models for many minority languages. Canton...
With more than 1.3 billion people, China exhibits an array of diversity in many domains, including e...
English pretrained language models, which make up the backbone of many modern NLP systems, require h...
Realizing general-purpose language intelligence has been a longstanding goal for natural language pr...
Although there are increasing and significant ties between China and Portuguese-speaking countries, ...
Large-scale corpora play a vital role in the construction of large language models (LLMs). However, ...
Producing machine translation (MT) for the many minority languages in the world is a serious challen...
In this paper we share findings from our effort to build practical machine translation (MT) systems ...
[Extract] Do English-Chinese bilinguals process their languages differently from monolinguals? What ...
China is a very multicultural and multiethnic country with a huge number of languages and dialects w...
In this paper, we introduce a massively multilingual speech corpora with fine-grained phonemic trans...
To investigate the role of linguistic knowledge in data augmentation (DA) for Natural Language Proce...
Offensive language detection is increasingly crucial for maintaining a civilized social media platfo...
Natural language processing (NLP) has a significant impact on society via technologies such as machi...
Scaling multilingual representation learning beyond the hundred most frequent languages is challengi...