The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions and lemmas with about one fourth of the more problematic annotations hand-validated. The morphosyntactic descriptions are given in both the JOS/MULTEXT-East framework (http://nl.ijs.si/ME/V6/msd/), as well as in the framework of Universal Dependencies for Slovene (https://universaldependencies.org/treebanks/sl_ssj/index.html). The corpus is available in source TEI XML with the MSDs in English or Slovene and in the derived vertical format, used by CQP and (no)Sketch Engine concordancers and in CONLL...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consistin...
Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard trai...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is me...
The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is mean...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
The JOS morphosyntactic resources for Slovene consist of the specifications, lexicon, and two corpor...
The ssj500k training corpus contains 500,000 words, manually annotated on the levels of tokenization...
The ssj500k training corpus is based on two training corpora built within the JOS project (http://nl...
The SETimes.SR training corpus contains 86 726 tokens manually annotated on the levels of tokenisati...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenis...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consistin...
Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard trai...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is me...
The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is mean...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
The JOS morphosyntactic resources for Slovene consist of the specifications, lexicon, and two corpor...
The ssj500k training corpus contains 500,000 words, manually annotated on the levels of tokenization...
The ssj500k training corpus is based on two training corpora built within the JOS project (http://nl...
The SETimes.SR training corpus contains 86 726 tokens manually annotated on the levels of tokenisati...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenis...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consistin...
Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard trai...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is me...