Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 20,000 short texts (280,000 words), mostly tweets but also blogs, forums and news comments. The corpus is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, and word normalisation of non-standard Slovene. The corpus is composed of three parts. One is Janes-Norm 1.2 proper (5,000 texts and 93,000 words, texts to 2016), which has automatically assigned lemmas and morphosyntactic tags. The other two parts constitute the complete Janes-Tag 3.0 (http://hdl.handle.net/11356/1732) corpus, which has manually annotated morphosyntactic tagging, lemmatisation and named entity annotation (15,...
ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-sta...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standa...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is me...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is me...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is me...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard trai...
Janes-Kratko is a corpus of Slovene tweets manually annotated with shortening phenomena according to...
ReLDI-NormTagNER-sr 2.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-stan...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
Janes-Vejica is a corpus of Slovene tweets where commas are annotated with the reason for their (in)...
ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-sta...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standa...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is me...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is me...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is me...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard trai...
Janes-Kratko is a corpus of Slovene tweets manually annotated with shortening phenomena according to...
ReLDI-NormTagNER-sr 2.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-stan...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
Janes-Vejica is a corpus of Slovene tweets where commas are annotated with the reason for their (in)...
ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-sta...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standa...