We present research aiming to build tools for the normalization of User-Generated Content (UGC). We argue that processing this type of text requires the revisiting of the initial steps of Natural Language Processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user-generated texts) presents a number of nonstandard communicative and linguistic characteristics – often closer to oral and colloquial language than to edited text. We present a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews, and blogs, and describe its main characteristics. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipel...
User-generated content has become a re-current resource for NLP tools and ap-plications, hence many ...
User–generated content published on microblogging social networks constitutes a priceless source of ...
International audienceLanguage model-based pre-trained representations have become ubiquitous in nat...
We present research aiming to build tools for the normalization of User-Generated Content (UGC). We ...
We present work in progress aiming to build tools for the normalization of User-Generated Content (U...
We present work in progress aiming to build tools for the normalization of User-Generated Content (U...
User-generated contents (UGC) represent an important source of information for governments, companie...
The language used in social media is often characterized by the abundance of informal and non-standa...
This paper presents a system to normalize Spanish tweets, which uses preprocessing rules, a domain-a...
One of the most persistent characteristics of written user-generated content (UGC) is the use of non...
National audienceThe boom of natural language processing (NLP) is taking place in a world where more...
The language used in social media is often characterized by the abundance of informal and non-standa...
Conteúdo Gerado por Usuário (CGU) é a denominação dada ao conteúdo criado de forma espontânea por in...
In this article we describe the microtext normalization system we have used to par-ticipate in the N...
Trátase dun resumo estendido da ponencia[Abstract] User-generated content published on microblogging...
User-generated content has become a re-current resource for NLP tools and ap-plications, hence many ...
User–generated content published on microblogging social networks constitutes a priceless source of ...
International audienceLanguage model-based pre-trained representations have become ubiquitous in nat...
We present research aiming to build tools for the normalization of User-Generated Content (UGC). We ...
We present work in progress aiming to build tools for the normalization of User-Generated Content (U...
We present work in progress aiming to build tools for the normalization of User-Generated Content (U...
User-generated contents (UGC) represent an important source of information for governments, companie...
The language used in social media is often characterized by the abundance of informal and non-standa...
This paper presents a system to normalize Spanish tweets, which uses preprocessing rules, a domain-a...
One of the most persistent characteristics of written user-generated content (UGC) is the use of non...
National audienceThe boom of natural language processing (NLP) is taking place in a world where more...
The language used in social media is often characterized by the abundance of informal and non-standa...
Conteúdo Gerado por Usuário (CGU) é a denominação dada ao conteúdo criado de forma espontânea por in...
In this article we describe the microtext normalization system we have used to par-ticipate in the N...
Trátase dun resumo estendido da ponencia[Abstract] User-generated content published on microblogging...
User-generated content has become a re-current resource for NLP tools and ap-plications, hence many ...
User–generated content published on microblogging social networks constitutes a priceless source of ...
International audienceLanguage model-based pre-trained representations have become ubiquitous in nat...