The SETimes.HBS dataset consists of parallel documents written in Bosnian, Croatian and Serbian, harvested from the already inactive setimes.com website publishing news in the languages of South-Eastern Europe. While the writing process of the documents is not known, they are quite likely independent translations from English. The main intended usage of this dataset is closely-related language discrimination. This dataset is not a traditional parallel dataset as there are no explicit links between parallel documents. Special care was taken that the training, development and testing bins of the dataset contain the same documents in all three languages as data leakage between the three bins, given the similarity of the three languages, could ...
The KAS-biterm bilingual term extraction dataset contains complete sentences selected from PhD these...
The dataset comprises 36570 examples of student writing from Slovenian primary and secondary schools...
A comprehensive corpus of user comments on online news articles on the topic of language from major ...
The Twitter-HBS dataset consists of Twitter users, their tweets, and the label of their predominantl...
A comprehensive corpus of news articles on the topic of language, published in major daily newspaper...
This paper explores the differences between three Slavic languages: Bosnian, Croatian and Serbian, d...
Machine translation between closely related languages is less challenging and exhibits a smaller num...
A comprehensive corpus of user comments on online news articles on the topic of language from major ...
A comprehensive corpus of news articles on the topic of language, published in major Montenegrin dai...
The FRENK dataset consists of comments to Facebook posts (news articles) of mainstream media outlets...
The META-NET research on language technologies in 2012 showed a weak support on tools for crossing t...
The best way to improve a statistical machine translation system is to identify concrete problems ca...
This paper describes the ADAPT-DCU machine translation systems built for the WMT 2020 shared task on...
The Machine Translation datasets KAS-MT 1.0 contain automatically sentence-aligned Slovene and Engli...
written; domain-specific (newspaper); synchronic; bilingual; parallel; unidirectional; XML; S-alignm...
The KAS-biterm bilingual term extraction dataset contains complete sentences selected from PhD these...
The dataset comprises 36570 examples of student writing from Slovenian primary and secondary schools...
A comprehensive corpus of user comments on online news articles on the topic of language from major ...
The Twitter-HBS dataset consists of Twitter users, their tweets, and the label of their predominantl...
A comprehensive corpus of news articles on the topic of language, published in major daily newspaper...
This paper explores the differences between three Slavic languages: Bosnian, Croatian and Serbian, d...
Machine translation between closely related languages is less challenging and exhibits a smaller num...
A comprehensive corpus of user comments on online news articles on the topic of language from major ...
A comprehensive corpus of news articles on the topic of language, published in major Montenegrin dai...
The FRENK dataset consists of comments to Facebook posts (news articles) of mainstream media outlets...
The META-NET research on language technologies in 2012 showed a weak support on tools for crossing t...
The best way to improve a statistical machine translation system is to identify concrete problems ca...
This paper describes the ADAPT-DCU machine translation systems built for the WMT 2020 shared task on...
The Machine Translation datasets KAS-MT 1.0 contain automatically sentence-aligned Slovene and Engli...
written; domain-specific (newspaper); synchronic; bilingual; parallel; unidirectional; XML; S-alignm...
The KAS-biterm bilingual term extraction dataset contains complete sentences selected from PhD these...
The dataset comprises 36570 examples of student writing from Slovenian primary and secondary schools...
A comprehensive corpus of user comments on online news articles on the topic of language from major ...