The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600 PhD theses (82 thousand texts, 5 million pages or 1,5 billion tokens) written 2000 - 2018 and gathered from the digital libraries of Slovene higher education institutions via the Slovene Open Science portal (http://openscience.si/). The theses have associated with them significant metadata, while each thesis in the corpus contains its textual body, i.e. without their front and back matter. The body is divided into chapters, then into pages, these into paragraphs, and then into sentences. The sentence tokens are tagged with morphosyntactically descriptions (detailed part-of-speech tags) and the words lemmatised. As opposed to the prev...
MAKS (MlAdinski KorpuS, i.e. the Youth Corpus) includes texts from literature, newspapers, and, to a...
The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative ...
Corpus ccGigafida consists of paragraph samples from 31,722 documents, each containing information a...
The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600...
The KAS-dr corpus of Slovene PhD theses consists of almost 1,600 texts (266 thousand pages or 100 mi...
The KAS-mag corpus of Slovene MSc/MA theses consists of almost 16,000 texts (1,360 thousand pages or...
The KAS-dipl corpus of Slovene BSc/BA theses consists of almost 65,000 texts (3,5 million pages or 1...
Corpus of Academic Slovene (KAS) contains Slovene BSc/BA, MSc/MA, and PhD theses from 2000 - 2018. W...
Summarization datasets were created from the text bodies in the KAS 2.0 corpus (http://hdl.handle.ne...
The KAS-abs corpus contains 108,254 automatically identified Slovenian and/or English abstracts (30 ...
The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 8,3...
The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 6,3...
The KUUS corpus comprises 17 textbooks and 7 workbooks (over 700,000 words) for Slovenian as a secon...
The Machine Translation datasets KAS-MT 1.0 contain automatically sentence-aligned Slovene and Engli...
Gigafida 2.0, with about 1.1 billion words, is a reference corpus of written Slovene text published ...
MAKS (MlAdinski KorpuS, i.e. the Youth Corpus) includes texts from literature, newspapers, and, to a...
The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative ...
Corpus ccGigafida consists of paragraph samples from 31,722 documents, each containing information a...
The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600...
The KAS-dr corpus of Slovene PhD theses consists of almost 1,600 texts (266 thousand pages or 100 mi...
The KAS-mag corpus of Slovene MSc/MA theses consists of almost 16,000 texts (1,360 thousand pages or...
The KAS-dipl corpus of Slovene BSc/BA theses consists of almost 65,000 texts (3,5 million pages or 1...
Corpus of Academic Slovene (KAS) contains Slovene BSc/BA, MSc/MA, and PhD theses from 2000 - 2018. W...
Summarization datasets were created from the text bodies in the KAS 2.0 corpus (http://hdl.handle.ne...
The KAS-abs corpus contains 108,254 automatically identified Slovenian and/or English abstracts (30 ...
The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 8,3...
The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 6,3...
The KUUS corpus comprises 17 textbooks and 7 workbooks (over 700,000 words) for Slovenian as a secon...
The Machine Translation datasets KAS-MT 1.0 contain automatically sentence-aligned Slovene and Engli...
Gigafida 2.0, with about 1.1 billion words, is a reference corpus of written Slovene text published ...
MAKS (MlAdinski KorpuS, i.e. the Youth Corpus) includes texts from literature, newspapers, and, to a...
The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative ...
Corpus ccGigafida consists of paragraph samples from 31,722 documents, each containing information a...