National audienceThe boom of natural language processing (NLP) is taking place in a world where more and more content is produced online. On social networks especially, textual content published by users are full of “non-standard” phenomena such as spelling mistakes, jargon, marks of expressiveness, etc. Thus, NLP models, which are largely trained on “standard” data, suffer a decline in performance when applied to user-generated content (UGC). One approach to mitigate this degradation is through lexical normalisation where non-standard words are replaced by their standard forms. In this paper, we review the state of the art of lexical normalisation of UGC, as well as run a preliminary experimental study to show the advantages and difficulti...
As social media constitute a valuable source for data analysis for a wide range of applications, the...
Lexical normalization is the task of transforming an utterance into its standardized form. This task...
One of the most persistent characteristics of written user-generated content (UGC) is the use of non...
The information contained in messages posted on the Internet (forums, social networks, review sites....
In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluatin...
Existing natural language processing systems have often been designed with standard texts in mind. H...
The work reported in this paper consisted in the creation of an automatic normalization tool for non...
Lexical normalization is the task of transforming an utterance into its standardized form. This task...
User-generated contents (UGC) represent an important source of information for governments, companie...
We present research aiming to build tools for the normalization of User-Generated Content (UGC). We ...
The ever-growing usage of social media platforms generates daily vast amounts of textual data which ...
We present work in progress aiming to build tools for the normalization of User-Generated Content (U...
In this work we present a taxonomy of error categories for lexical normalization, which is the task ...
We present work in progress aiming to build tools for the normalization of User-Generated Content (U...
Text normalization is an indispensable stage for natural language processing of social media data wi...
As social media constitute a valuable source for data analysis for a wide range of applications, the...
Lexical normalization is the task of transforming an utterance into its standardized form. This task...
One of the most persistent characteristics of written user-generated content (UGC) is the use of non...
The information contained in messages posted on the Internet (forums, social networks, review sites....
In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluatin...
Existing natural language processing systems have often been designed with standard texts in mind. H...
The work reported in this paper consisted in the creation of an automatic normalization tool for non...
Lexical normalization is the task of transforming an utterance into its standardized form. This task...
User-generated contents (UGC) represent an important source of information for governments, companie...
We present research aiming to build tools for the normalization of User-Generated Content (UGC). We ...
The ever-growing usage of social media platforms generates daily vast amounts of textual data which ...
We present work in progress aiming to build tools for the normalization of User-Generated Content (U...
In this work we present a taxonomy of error categories for lexical normalization, which is the task ...
We present work in progress aiming to build tools for the normalization of User-Generated Content (U...
Text normalization is an indispensable stage for natural language processing of social media data wi...
As social media constitute a valuable source for data analysis for a wide range of applications, the...
Lexical normalization is the task of transforming an utterance into its standardized form. This task...
One of the most persistent characteristics of written user-generated content (UGC) is the use of non...