Automatic Language Identification (LI) is a widely addressed task, but not all users (for example linguists) have the means or interest to develop their own tool or to train the existing ones with their own data. There are several off-the-shelf LI tools, but for some languages, it is unclear which tool is the best for specific types of text. This article presents a comparison of the performance of several off-the-shelf language identification tools on Bulgarian social media data. The LI tools are tested on a multilingual Twitter dataset (composed of 2966 tweets) and an existing Bulgarian Twitter dataset on the topic of fake content detection of 3350 tweets. The article presents the manual annotation procedure of the first dataset, a dis- cu...
Offering access to information in microblog posts requires successful language identification. Langu...
Automatic analyzing and extracting useful information from the noisy social media content are curren...
Investigations of language use in multilingual regions are traditionally done through the usage of r...
none3siMultilingual speakers communicate in more than one language in daily life and on social media...
Multilingual speakers communicate in more than one language in daily life and on social media. In or...
We present an evaluation of “off-the-shelf ” language identification systems as applied to microblog...
Abstract Multilingual posts can potentially affect the outcomes of content analysis on microblog pla...
Multilingual posts can potentially affect the outcomes of content analysis on microblog platforms. T...
Automatically detecting disinformation is an important Natural Language Processing (NLP) task whose ...
The cross-disciplinary Nordic Tweet Stream (NTS) is a project aiming at creating a multilingual text...
The Twitter-HBS dataset consists of Twitter users, their tweets, and the label of their predominantl...
The world is growing more connected through the use of online communication, exposing software and h...
This dataset has been created within Project TRACES (more information: https://traces.gate-ai.eu/). ...
The microblogging service Twitter provides vast amounts of user-generated language data. In this art...
Native Language Identification is one of the growing subfields in Natural Language Processing (NLP)....
Offering access to information in microblog posts requires successful language identification. Langu...
Automatic analyzing and extracting useful information from the noisy social media content are curren...
Investigations of language use in multilingual regions are traditionally done through the usage of r...
none3siMultilingual speakers communicate in more than one language in daily life and on social media...
Multilingual speakers communicate in more than one language in daily life and on social media. In or...
We present an evaluation of “off-the-shelf ” language identification systems as applied to microblog...
Abstract Multilingual posts can potentially affect the outcomes of content analysis on microblog pla...
Multilingual posts can potentially affect the outcomes of content analysis on microblog platforms. T...
Automatically detecting disinformation is an important Natural Language Processing (NLP) task whose ...
The cross-disciplinary Nordic Tweet Stream (NTS) is a project aiming at creating a multilingual text...
The Twitter-HBS dataset consists of Twitter users, their tweets, and the label of their predominantl...
The world is growing more connected through the use of online communication, exposing software and h...
This dataset has been created within Project TRACES (more information: https://traces.gate-ai.eu/). ...
The microblogging service Twitter provides vast amounts of user-generated language data. In this art...
Native Language Identification is one of the growing subfields in Natural Language Processing (NLP)....
Offering access to information in microblog posts requires successful language identification. Langu...
Automatic analyzing and extracting useful information from the noisy social media content are curren...
Investigations of language use in multilingual regions are traditionally done through the usage of r...