The TDX Thesis Spanish Corpus is a 246-million-token corpus of Spanish clean text extracted from scientific thesis of the domain tdx.cat, which contains open thesis published by Catalan universities. The corpus has been preprocessed and deduplicated using the Corpus-Cleaner pipeline. It consists of 248.676.517 tokens, 8.156.059 sentences and 9.790. Documents are separated by single new lines. We license the actual packaging of these data under a Attribution 4.0 International License. Copyright by Secretaría de Estado de Digitalización e Inteligencia Artificial (SEDIA) (2022)Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL)
[EN] CorpusNet is a hub of bilingual and multilingual corpora and related resources featuring any of...
This article describes the use of an English-Spanish Parallel Corpus of Science and Technology Texts...
http://www.hispanicseminary.org/t&c/cro/index-en.htm "The Spanish Chronicle Texts corpus, a free onl...
The TDX Thesis Spanish Corpus is a 246-million-token corpus of Spanish clean text extracted from sci...
The CSIC Spanish corpus is a 146-million-token corpus of Spanish scientific magazines from the revis...
The InfoLibros corpus is a 218-million-token corpus of Spanish narratives extracted from free books ...
The Catalan Newswire Corpus is a 163-million-token corpus of Catalan newswire text built from three ...
The Padicat (Patrimoni Digital de Catalunya) Corpus is a 111-million-token corpus of crawled Catalan...
The Catalan Textual Corpus is a 1760-million-token web corpus of Catalan built from several sources:...
Description The largest Spanish biomedical and heath corpus to date gathered from a massive Spanish...
BasCrawl is a 186-million-token web corpus of Basque obtained by crawling over 12000 domains. We inc...
The corpus consists of a number of specialized texts (Law, Economics, Medicine, Environment and Comp...
The Catalan Government Crawling Corpus is a 39-million-token web corpus of Catalan built from the we...
DESCRIPTION: ACTIV-ES is a comparable Spanish corpus comprised of film dialogue from Argentine, Mexi...
Spanish text-corpus extracted from Wikipedia, using the platform described on Cadavid Rengifo, Hécto...
[EN] CorpusNet is a hub of bilingual and multilingual corpora and related resources featuring any of...
This article describes the use of an English-Spanish Parallel Corpus of Science and Technology Texts...
http://www.hispanicseminary.org/t&c/cro/index-en.htm "The Spanish Chronicle Texts corpus, a free onl...
The TDX Thesis Spanish Corpus is a 246-million-token corpus of Spanish clean text extracted from sci...
The CSIC Spanish corpus is a 146-million-token corpus of Spanish scientific magazines from the revis...
The InfoLibros corpus is a 218-million-token corpus of Spanish narratives extracted from free books ...
The Catalan Newswire Corpus is a 163-million-token corpus of Catalan newswire text built from three ...
The Padicat (Patrimoni Digital de Catalunya) Corpus is a 111-million-token corpus of crawled Catalan...
The Catalan Textual Corpus is a 1760-million-token web corpus of Catalan built from several sources:...
Description The largest Spanish biomedical and heath corpus to date gathered from a massive Spanish...
BasCrawl is a 186-million-token web corpus of Basque obtained by crawling over 12000 domains. We inc...
The corpus consists of a number of specialized texts (Law, Economics, Medicine, Environment and Comp...
The Catalan Government Crawling Corpus is a 39-million-token web corpus of Catalan built from the we...
DESCRIPTION: ACTIV-ES is a comparable Spanish corpus comprised of film dialogue from Argentine, Mexi...
Spanish text-corpus extracted from Wikipedia, using the platform described on Cadavid Rengifo, Hécto...
[EN] CorpusNet is a hub of bilingual and multilingual corpora and related resources featuring any of...
This article describes the use of an English-Spanish Parallel Corpus of Science and Technology Texts...
http://www.hispanicseminary.org/t&c/cro/index-en.htm "The Spanish Chronicle Texts corpus, a free onl...