The CSIC Spanish corpus is a 146-million-token corpus of Spanish scientific magazines from the revistas.csic.es/ repository. The corpus has been preprocessed and deduplicated using the Corpus-Cleaner pipeline. It consists of 146.795.650 tokens, 4.395.368 sentences and 30.929. Documents are separated by single new lines. We license the actual packaging of these data under a Attribution 4.0 International License. Copyright by Secretaría de Estado de Digitalización e Inteligencia Artificial (SEDIA) (2022)Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL)
[EN] The Iberia Corpus is a software computing tool first created between 2008 and 2009 within the “...
This article describes the use of an English-Spanish Parallel Corpus of Science and Technology Texts...
DESCRIPTION: ACTIV-ES is a comparable Spanish corpus comprised of film dialogue from Argentine, Mexi...
The CSIC Spanish corpus is a 146-million-token corpus of Spanish scientific magazines from the revis...
The InfoLibros corpus is a 218-million-token corpus of Spanish narratives extracted from free books ...
The TDX Thesis Spanish Corpus is a 246-million-token corpus of Spanish clean text extracted from sci...
The Catalan Newswire Corpus is a 163-million-token corpus of Catalan newswire text built from three ...
BasCrawl is a 186-million-token web corpus of Basque obtained by crawling over 12000 domains. We inc...
Description The largest Spanish biomedical and heath corpus to date gathered from a massive Spanish...
The Catalan Textual Corpus is a 1760-million-token web corpus of Catalan built from several sources:...
The Padicat (Patrimoni Digital de Catalunya) Corpus is a 111-million-token corpus of crawled Catalan...
The corpus consists of a number of specialized texts (Law, Economics, Medicine, Environment and Comp...
The Iberia Corpus is a software computing tool first created between 2008 and 2009 within the “Conse...
The Catalan Government Crawling Corpus is a 39-million-token web corpus of Catalan built from the we...
AbstractEveryone working on general language would like their corpus to be bigger, wider-coverage, c...
[EN] The Iberia Corpus is a software computing tool first created between 2008 and 2009 within the “...
This article describes the use of an English-Spanish Parallel Corpus of Science and Technology Texts...
DESCRIPTION: ACTIV-ES is a comparable Spanish corpus comprised of film dialogue from Argentine, Mexi...
The CSIC Spanish corpus is a 146-million-token corpus of Spanish scientific magazines from the revis...
The InfoLibros corpus is a 218-million-token corpus of Spanish narratives extracted from free books ...
The TDX Thesis Spanish Corpus is a 246-million-token corpus of Spanish clean text extracted from sci...
The Catalan Newswire Corpus is a 163-million-token corpus of Catalan newswire text built from three ...
BasCrawl is a 186-million-token web corpus of Basque obtained by crawling over 12000 domains. We inc...
Description The largest Spanish biomedical and heath corpus to date gathered from a massive Spanish...
The Catalan Textual Corpus is a 1760-million-token web corpus of Catalan built from several sources:...
The Padicat (Patrimoni Digital de Catalunya) Corpus is a 111-million-token corpus of crawled Catalan...
The corpus consists of a number of specialized texts (Law, Economics, Medicine, Environment and Comp...
The Iberia Corpus is a software computing tool first created between 2008 and 2009 within the “Conse...
The Catalan Government Crawling Corpus is a 39-million-token web corpus of Catalan built from the we...
AbstractEveryone working on general language would like their corpus to be bigger, wider-coverage, c...
[EN] The Iberia Corpus is a software computing tool first created between 2008 and 2009 within the “...
This article describes the use of an English-Spanish Parallel Corpus of Science and Technology Texts...
DESCRIPTION: ACTIV-ES is a comparable Spanish corpus comprised of film dialogue from Argentine, Mexi...