One of the most persistent characteristics of written user-generated content (UGC) is the use of non-standard words. This characteristic contributes to an increased difficulty to automatically process and analyze UGC. Text normalization is the task of transforming lexical variants to their canonical forms and is often used as a pre-processing step for conventional NLP tasks in order to overcome the performance drop that NLP systems experience when applied to UGC. In this work, we follow a Neural Machine Translation approach to text normalization. To train such an encoder-decoder model, large parallel training corpora of sentence pairs are required. However, obtaining large data sets with UGC and their normalized version is not trivial, espe...
We present research aiming to build tools for the normalization of User-Generated Content (UGC). We ...
International audienceThe creation of text corpora requires a sequence of processing steps in order ...
Existing natural language processing systems have often been designed with standard texts in mind. H...
One of the most persistent characteristics of written user-generated content (UGC) is the use of non...
One of the main characteristics of social media data is the use of non-standard language. Since NLP ...
Social media texts have become one of the most used forms of written language and a valuable source ...
In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluatin...
International audienceLanguage model-based pre-trained representations have become ubiquitous in nat...
Abstract This paper describes a phrase-based machine translation approach to normalize Dutch user-ge...
User-generated contents (UGC) represent an important source of information for governments, companie...
Text normalization is the task of mapping non-canonical language, typical of speech transcription an...
We present work in progress aiming to build tools for the normalization of User-Generated Content (U...
Text normalization is the task of mapping noncanonical language, typical of speech transcription and...
In this work we present a taxonomy of error categories for lexical normalization, which is the task ...
As social media constitute a valuable source for data analysis for a wide range of applications, the...
We present research aiming to build tools for the normalization of User-Generated Content (UGC). We ...
International audienceThe creation of text corpora requires a sequence of processing steps in order ...
Existing natural language processing systems have often been designed with standard texts in mind. H...
One of the most persistent characteristics of written user-generated content (UGC) is the use of non...
One of the main characteristics of social media data is the use of non-standard language. Since NLP ...
Social media texts have become one of the most used forms of written language and a valuable source ...
In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluatin...
International audienceLanguage model-based pre-trained representations have become ubiquitous in nat...
Abstract This paper describes a phrase-based machine translation approach to normalize Dutch user-ge...
User-generated contents (UGC) represent an important source of information for governments, companie...
Text normalization is the task of mapping non-canonical language, typical of speech transcription an...
We present work in progress aiming to build tools for the normalization of User-Generated Content (U...
Text normalization is the task of mapping noncanonical language, typical of speech transcription and...
In this work we present a taxonomy of error categories for lexical normalization, which is the task ...
As social media constitute a valuable source for data analysis for a wide range of applications, the...
We present research aiming to build tools for the normalization of User-Generated Content (UGC). We ...
International audienceThe creation of text corpora requires a sequence of processing steps in order ...
Existing natural language processing systems have often been designed with standard texts in mind. H...