In this paper we present hr500k, a Croatian reference training corpus of 500 thousand tokens, segmented at document, sentence and word level, and annotated for morphosyntax, lemmas, dependency syntax, named entities, and semantic roles. We present each annotation layer via basic label statistics and describe the final encoding of the resource in CoNLL and TEI formats. We also give a description of the rather turbulent history of the resource and give insights into the topic and genre distribution in the corpus. Finally, we discuss further enrichments of the corpus with additional layers, which are already underway
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is mean...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenis...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
The Croatian web corpus hrWaC was built by crawling the .hr top-level domain in 2011 and again in 20...
ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-sta...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is me...
A comprehensive corpus of user comments on online news articles on the topic of language from major ...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is me...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is me...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is mean...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenis...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokeni...
The Croatian web corpus hrWaC was built by crawling the .hr top-level domain in 2011 and again in 20...
ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-sta...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is me...
A comprehensive corpus of user comments on online news articles on the topic of language from major ...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is me...
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is me...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...
The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is mean...
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is mea...