The ssj500k training corpus contains 500,000 words, manually annotated on the levels of tokenization, sentence segmentation, morphosyntactic tagging, lemmatisation, named entities, and, partially, syntactic dependencies. The ssj500k corpus uses the MULTEXT-East / JOS morphosyntactic tagset and the JOS dependency schema and is based on the jos100k and jos1M corpora. Note that this entry updates ssj500k 1.3 by fixing many annotation errors
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consistin...
Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard trai...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
The ssj500k training corpus is based on two training corpora built within the JOS project (http://nl...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is mean...
The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenis...
The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is mean...
The SETimes.SR training corpus contains 86 726 tokens manually annotated on the levels of tokenisati...
The JOS morphosyntactic resources for Slovene consist of the specifications, lexicon, and two corpor...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consistin...
Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard trai...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
The ssj500k training corpus is based on two training corpora built within the JOS project (http://nl...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is mean...
The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenis...
The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is mean...
The SETimes.SR training corpus contains 86 726 tokens manually annotated on the levels of tokenisati...
The JOS morphosyntactic resources for Slovene consist of the specifications, lexicon, and two corpor...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consistin...
Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard trai...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...