Corpus of Academic Slovene (KAS) contains Slovene BSc/BA, MSc/MA, and PhD theses from 2000 - 2018. We present a cleaner version of the corpus with added text segmentation and updated POS-tagging. The updated corpus of abstracts contains fewer artefacts. Using machine learning classifiers, we filled in miss- ing research field information in the metadata. We used the full texts and corresponding abstracts to create several new datasets: monolingual and cross-lingual datasets for long text summariza- tion of academic texts and a dataset of aligned sentences from abstracts in English and Slovene, suitable for machine transla- tion. We release the corpora, datasets, and developed source code under a permissible licence
MAKS (MlAdinski KorpuS, i.e. the Youth Corpus) includes texts from literature, newspapers, and, to a...
The KUUS corpus comprises 17 textbooks and 7 workbooks (over 700,000 words) for Slovenian as a secon...
In the last decade, corpus linguistics has finally established itself as a separate research startin...
The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600...
The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600...
The KAS-dr corpus of Slovene PhD theses consists of almost 1,600 texts (266 thousand pages or 100 mi...
The KAS-mag corpus of Slovene MSc/MA theses consists of almost 16,000 texts (1,360 thousand pages or...
The KAS-dipl corpus of Slovene BSc/BA theses consists of almost 65,000 texts (3,5 million pages or 1...
The KAS-abs corpus contains 108,254 automatically identified Slovenian and/or English abstracts (30 ...
Summarization datasets were created from the text bodies in the KAS 2.0 corpus (http://hdl.handle.ne...
The Machine Translation datasets KAS-MT 1.0 contain automatically sentence-aligned Slovene and Engli...
The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 8,3...
The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 6,3...
The KAS-biterm bilingual term extraction dataset contains complete sentences selected from PhD these...
The JOS morphosyntactic resources for Slovene consist of the specifications, lexicon, and two corpor...
MAKS (MlAdinski KorpuS, i.e. the Youth Corpus) includes texts from literature, newspapers, and, to a...
The KUUS corpus comprises 17 textbooks and 7 workbooks (over 700,000 words) for Slovenian as a secon...
In the last decade, corpus linguistics has finally established itself as a separate research startin...
The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600...
The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600...
The KAS-dr corpus of Slovene PhD theses consists of almost 1,600 texts (266 thousand pages or 100 mi...
The KAS-mag corpus of Slovene MSc/MA theses consists of almost 16,000 texts (1,360 thousand pages or...
The KAS-dipl corpus of Slovene BSc/BA theses consists of almost 65,000 texts (3,5 million pages or 1...
The KAS-abs corpus contains 108,254 automatically identified Slovenian and/or English abstracts (30 ...
Summarization datasets were created from the text bodies in the KAS 2.0 corpus (http://hdl.handle.ne...
The Machine Translation datasets KAS-MT 1.0 contain automatically sentence-aligned Slovene and Engli...
The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 8,3...
The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 6,3...
The KAS-biterm bilingual term extraction dataset contains complete sentences selected from PhD these...
The JOS morphosyntactic resources for Slovene consist of the specifications, lexicon, and two corpor...
MAKS (MlAdinski KorpuS, i.e. the Youth Corpus) includes texts from literature, newspapers, and, to a...
The KUUS corpus comprises 17 textbooks and 7 workbooks (over 700,000 words) for Slovenian as a secon...
In the last decade, corpus linguistics has finally established itself as a separate research startin...