In the recent years, transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of qu...
Spanish text-corpus extracted from Wikipedia, using the platform described on Cadavid Rengifo, Hécto...
The Catalan Government Crawling Corpus is a 39-million-token web corpus of Catalan built from the we...
This paper describes crawling and corpus processing in a distributed framework. We present new tools...
In the recent years, Transformer-based models have lead to significant advances in language modellin...
The largest Spanish biomedical and heath corpus to date gathered from a massive Spanish health domai...
AbstractEveryone working on general language would like their corpus to be bigger, wider-coverage, c...
BasCrawl is a 186-million-token web corpus of Basque obtained by crawling over 12000 domains. We inc...
The Catalan Textual Corpus is a 1760-million-token web corpus of Catalan built from several sources:...
Comunicació presentada a: EACL '06: Eleventh Conference of the European Chapter of the Association f...
The CSIC Spanish corpus is a 146-million-token corpus of Spanish scientific magazines from the revis...
Iberia is a synchronic corpus of scientific Spanish designed mainly for terminological studies. In t...
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
AbstractThis paper outlines current work on the construction of a high-quality, richly-annotated and...
12 pages, 6 figures, 2 tablesInternational audienceThe need for raw large raw corpora has dramatical...
Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks....
Spanish text-corpus extracted from Wikipedia, using the platform described on Cadavid Rengifo, Hécto...
The Catalan Government Crawling Corpus is a 39-million-token web corpus of Catalan built from the we...
This paper describes crawling and corpus processing in a distributed framework. We present new tools...
In the recent years, Transformer-based models have lead to significant advances in language modellin...
The largest Spanish biomedical and heath corpus to date gathered from a massive Spanish health domai...
AbstractEveryone working on general language would like their corpus to be bigger, wider-coverage, c...
BasCrawl is a 186-million-token web corpus of Basque obtained by crawling over 12000 domains. We inc...
The Catalan Textual Corpus is a 1760-million-token web corpus of Catalan built from several sources:...
Comunicació presentada a: EACL '06: Eleventh Conference of the European Chapter of the Association f...
The CSIC Spanish corpus is a 146-million-token corpus of Spanish scientific magazines from the revis...
Iberia is a synchronic corpus of scientific Spanish designed mainly for terminological studies. In t...
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
AbstractThis paper outlines current work on the construction of a high-quality, richly-annotated and...
12 pages, 6 figures, 2 tablesInternational audienceThe need for raw large raw corpora has dramatical...
Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks....
Spanish text-corpus extracted from Wikipedia, using the platform described on Cadavid Rengifo, Hécto...
The Catalan Government Crawling Corpus is a 39-million-token web corpus of Catalan built from the we...
This paper describes crawling and corpus processing in a distributed framework. We present new tools...