We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy
Lexical normalization is the task of transforming an utterance into its standardized form. This task...
In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluatin...
While parsing performance on in-domain text has developed steadily in recent years, out-of-domain te...
We present herein our work on language identification applied to comments left by the readers of onl...
The automatic analysis (parsing) of natural language is an important ingredient for many natural lan...
National audienceThe boom of natural language processing (NLP) is taking place in a world where more...
As social media constitute a valuable source for data analysis for a wide range of applications, the...
We present research aiming to build tools for the normalization of User-Generated Content (UGC). We ...
Abstract This paper describes a phrase-based machine translation approach to normalize Dutch user-ge...
The ever-growing usage of social media platforms generates daily vast amounts of textual data which ...
User generated texts on the web are freely-available and lucrative sources of data for language tech...
Text normalization is an indispensable stage for natural language processing of social media data wi...
This article describes initial work into the automatic classification of user-generated content in n...
International audienceText normalization is a necessity to correct and make more sense of the micro-...
One of the main characteristics of social media data is the use of non-standard language. Since NLP ...
Lexical normalization is the task of transforming an utterance into its standardized form. This task...
In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluatin...
While parsing performance on in-domain text has developed steadily in recent years, out-of-domain te...
We present herein our work on language identification applied to comments left by the readers of onl...
The automatic analysis (parsing) of natural language is an important ingredient for many natural lan...
National audienceThe boom of natural language processing (NLP) is taking place in a world where more...
As social media constitute a valuable source for data analysis for a wide range of applications, the...
We present research aiming to build tools for the normalization of User-Generated Content (UGC). We ...
Abstract This paper describes a phrase-based machine translation approach to normalize Dutch user-ge...
The ever-growing usage of social media platforms generates daily vast amounts of textual data which ...
User generated texts on the web are freely-available and lucrative sources of data for language tech...
Text normalization is an indispensable stage for natural language processing of social media data wi...
This article describes initial work into the automatic classification of user-generated content in n...
International audienceText normalization is a necessity to correct and make more sense of the micro-...
One of the main characteristics of social media data is the use of non-standard language. Since NLP ...
Lexical normalization is the task of transforming an utterance into its standardized form. This task...
In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluatin...
While parsing performance on in-domain text has developed steadily in recent years, out-of-domain te...