esCorpius: A Massive Spanish Crawling Corpus

Gutiérrez-Fandiño, Asier
Pérez-Fernández, David
Armengol-Estapé, Jordi
Griol, David
Callejas, Zoraida

Publication date

July 2022

Abstract

In the recent years, transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of qu...

Extracted data

We use cookies to provide a better user experience.

Data Protection

esCorpius: A Massive Spanish Crawling Corpus

Abstract

Extracted data

esCorpius: A Massive Spanish Crawling Corpus

Abstract

Extracted data

Related items

Related items