The Twitter-HBS dataset consists of Twitter users, their tweets, and the label of their predominantly used language - Bosnian, Croatian, Montenegrin, or Serbian. Among the tweets, there are also tweets in other languages (mainly English) as the label encodes the predominantly used language of a user only. The main intended usage of this dataset is discrimination between closely-related languages on the level of a Twitter user (not a single tweet). The only pre-processing performed on the texts of the tweets is the transliteration from the Cyrillic into the Latin script so that the dataset measures the quality of the user classifications regardless of the script used
ReLDI-NormTag-sr 1.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standar...
ReLDI-NormTag-hr 1.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standa...
ReLDI-NormTag-sr 1.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standar...
In this paper we investigate the phenomenon of linguistic accommodation among Serbian Twitter users ...
The SETimes.HBS dataset consists of parallel documents written in Bosnian, Croatian and Serbian, har...
In this paper we discuss the parallel manual normalisation of samples extracted from Croatian and Se...
This is a Twitter dataset for code-mixed language identification. The dataset contains mixed Indones...
Automatic Language Identification (LI) is a widely addressed task, but not all users (for example li...
The dataset represents the Twitter production in Slovenian in the period from 2018 until 2020. It co...
The dataset contains over 1.6 million tweets (tweet IDs), labeled with sentiment by human annotators...
A trilingual Latvian-Russian-English corpus of tweets is presented with an analysis of users, langua...
This paper reports on a corpus-based analysis of demonym mentions in the corpus of Slovene tweets. F...
In this paper we describe how Twitter is used in various languages. We observe notable differences b...
In this paper we describe how Twitter is used in various languages. We observe notable differences b...
ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standa...
ReLDI-NormTag-sr 1.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standar...
ReLDI-NormTag-hr 1.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standa...
ReLDI-NormTag-sr 1.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standar...
In this paper we investigate the phenomenon of linguistic accommodation among Serbian Twitter users ...
The SETimes.HBS dataset consists of parallel documents written in Bosnian, Croatian and Serbian, har...
In this paper we discuss the parallel manual normalisation of samples extracted from Croatian and Se...
This is a Twitter dataset for code-mixed language identification. The dataset contains mixed Indones...
Automatic Language Identification (LI) is a widely addressed task, but not all users (for example li...
The dataset represents the Twitter production in Slovenian in the period from 2018 until 2020. It co...
The dataset contains over 1.6 million tweets (tweet IDs), labeled with sentiment by human annotators...
A trilingual Latvian-Russian-English corpus of tweets is presented with an analysis of users, langua...
This paper reports on a corpus-based analysis of demonym mentions in the corpus of Slovene tweets. F...
In this paper we describe how Twitter is used in various languages. We observe notable differences b...
In this paper we describe how Twitter is used in various languages. We observe notable differences b...
ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standa...
ReLDI-NormTag-sr 1.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standar...
ReLDI-NormTag-hr 1.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standa...
ReLDI-NormTag-sr 1.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standar...