We present work in progress aiming to build tools for the normalization of User-Generated Content (UGC). As we will see, the task requires the revisiting of the initial steps of NLP processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user texts) presents a number of non-standard communicative and linguistic characteristics, and is in fact much closer to oral and colloquial language than to edited text. We present and characterize a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews and blogs. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatiz...
In this paper we introduce TweetNorm es, an annotated corpus of tweets in Spanish language, which we...
The writing style used in social media usually contains informal elements that can lower the perform...
User generated texts on the web are freely-available and lucrative sources of data for language tech...
We present work in progress aiming to build tools for the normalization of User-Generated Content (U...
We present research aiming to build tools for the normalization of User-Generated Content (UGC). We ...
The language used in social media is often characterized by the abundance of informal and non-standa...
The language used in social media is often characterized by the abundance of informal and non-standa...
User-generated contents (UGC) represent an important source of information for governments, companie...
In this article we describe the microtext normalization system we have used to par-ticipate in the N...
User-generated content has become a re-current resource for NLP tools and ap-plications, hence many ...
National audienceThe boom of natural language processing (NLP) is taking place in a world where more...
The ever-growing usage of social media platforms generates daily vast amounts of textual data which ...
Conteúdo Gerado por Usuário (CGU) é a denominação dada ao conteúdo criado de forma espontânea por in...
As social media constitute a valuable source for data analysis for a wide range of applications, the...
This paper presents a system to normalize Spanish tweets, which uses preprocessing rules, a domain-a...
In this paper we introduce TweetNorm es, an annotated corpus of tweets in Spanish language, which we...
The writing style used in social media usually contains informal elements that can lower the perform...
User generated texts on the web are freely-available and lucrative sources of data for language tech...
We present work in progress aiming to build tools for the normalization of User-Generated Content (U...
We present research aiming to build tools for the normalization of User-Generated Content (UGC). We ...
The language used in social media is often characterized by the abundance of informal and non-standa...
The language used in social media is often characterized by the abundance of informal and non-standa...
User-generated contents (UGC) represent an important source of information for governments, companie...
In this article we describe the microtext normalization system we have used to par-ticipate in the N...
User-generated content has become a re-current resource for NLP tools and ap-plications, hence many ...
National audienceThe boom of natural language processing (NLP) is taking place in a world where more...
The ever-growing usage of social media platforms generates daily vast amounts of textual data which ...
Conteúdo Gerado por Usuário (CGU) é a denominação dada ao conteúdo criado de forma espontânea por in...
As social media constitute a valuable source for data analysis for a wide range of applications, the...
This paper presents a system to normalize Spanish tweets, which uses preprocessing rules, a domain-a...
In this paper we introduce TweetNorm es, an annotated corpus of tweets in Spanish language, which we...
The writing style used in social media usually contains informal elements that can lower the perform...
User generated texts on the web are freely-available and lucrative sources of data for language tech...