The amount of data produced in user-generated content continues to grow at a staggering rate. However, the text found in these media can deviate wildly from the standard rules of orthography, syntax and even semantics and present significant problems to downstream applications which make use of all this noisy data. In this paper we present a novel unsupervised method for extracting domain-specific lexical variants given a large volume of text. We demonstrate the utility of this method by applying it to normalize text messages found in the online social media service, Twitter, into their most likely standard English versions. Our method yields a 20% reduction in word error rate over an existing state-of-the-art approach
The informal nature of social media text renders it very difficult to be automati-cally processed by...
This research focuses on text processing in the sphere of English-language social media. We introduc...
Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and oth...
The amount of data produced in user-generated content continues to grow at a stag-gering rate. Howev...
User-generated content has become a re-current resource for NLP tools and ap-plications, hence many ...
While parsing performance on in-domain text has developed steadily in recent years, out-of-domain te...
One of the major challenges in the era of big data use is how to ‘clean’ the vast amount of data, pa...
One of the major challenges in the era of big data use is how to 'clean' the vast amount of data, pa...
The ever-growing usage of social media platforms generates daily vast amounts of textual data which ...
One of the major problems in the era of big data use is how to ‘clean’ the vast amount of data on th...
We propose an unsupervised method for au-tomatically calculating word usage similar-ity in social me...
Social media language contains huge amount and wide variety of nonstandard tokens, cre-ated both int...
AbstractThe impact of Social media and SMS is increasing in our daily lives. These sources provide t...
We present a simple yet effective approach to adapt part-of-speech (POS) taggers to new domains. Our...
Social media platforms such as Twitter have grown at a tremendous pace in recent years and have beco...
The informal nature of social media text renders it very difficult to be automati-cally processed by...
This research focuses on text processing in the sphere of English-language social media. We introduc...
Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and oth...
The amount of data produced in user-generated content continues to grow at a stag-gering rate. Howev...
User-generated content has become a re-current resource for NLP tools and ap-plications, hence many ...
While parsing performance on in-domain text has developed steadily in recent years, out-of-domain te...
One of the major challenges in the era of big data use is how to ‘clean’ the vast amount of data, pa...
One of the major challenges in the era of big data use is how to 'clean' the vast amount of data, pa...
The ever-growing usage of social media platforms generates daily vast amounts of textual data which ...
One of the major problems in the era of big data use is how to ‘clean’ the vast amount of data on th...
We propose an unsupervised method for au-tomatically calculating word usage similar-ity in social me...
Social media language contains huge amount and wide variety of nonstandard tokens, cre-ated both int...
AbstractThe impact of Social media and SMS is increasing in our daily lives. These sources provide t...
We present a simple yet effective approach to adapt part-of-speech (POS) taggers to new domains. Our...
Social media platforms such as Twitter have grown at a tremendous pace in recent years and have beco...
The informal nature of social media text renders it very difficult to be automati-cally processed by...
This research focuses on text processing in the sphere of English-language social media. We introduc...
Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and oth...