The amount of data produced in user-generated content continues to grow at a stag-gering rate. However, the text found in these media can deviate wildly from the standard rules of orthography, syntax and even seman-tics and present significant problems to down-stream applications which make use of this noisy data. In this paper we present a novel unsupervised method for extracting domain-specific lexical variants given a large volume of text. We demonstrate the utility of this method by applying it to normalize text mes-sages found in the online social media service, Twitter, into their most likely standard English versions. Our method yields a 20 % reduction in word error rate over an existing state-of-the-art approach.
Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and oth...
Social media platforms such as Twitter have grown at a tremendous pace in recent years and have beco...
The informal nature of social media text renders it very difficult to be automati-cally processed by...
The amount of data produced in user-generated content continues to grow at a staggering rate. Howeve...
User-generated content has become a re-current resource for NLP tools and ap-plications, hence many ...
While parsing performance on in-domain text has developed steadily in recent years, out-of-domain te...
We propose an unsupervised method for au-tomatically calculating word usage similar-ity in social me...
One of the major problems in the era of big data use is how to ‘clean’ the vast amount of data on th...
This research focuses on text processing in the sphere of English-language social media. We introduc...
AbstractThe impact of Social media and SMS is increasing in our daily lives. These sources provide t...
One of the major challenges in the era of big data use is how to ‘clean’ the vast amount of data, pa...
We present a simple yet effective approach to adapt part-of-speech (POS) taggers to new domains. Our...
The ever-growing usage of social media platforms generates daily vast amounts of textual data which ...
One of the major challenges in the era of big data use is how to 'clean' the vast amount of data, pa...
Social media language contains huge amount and wide variety of nonstandard tokens, cre-ated both int...
Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and oth...
Social media platforms such as Twitter have grown at a tremendous pace in recent years and have beco...
The informal nature of social media text renders it very difficult to be automati-cally processed by...
The amount of data produced in user-generated content continues to grow at a staggering rate. Howeve...
User-generated content has become a re-current resource for NLP tools and ap-plications, hence many ...
While parsing performance on in-domain text has developed steadily in recent years, out-of-domain te...
We propose an unsupervised method for au-tomatically calculating word usage similar-ity in social me...
One of the major problems in the era of big data use is how to ‘clean’ the vast amount of data on th...
This research focuses on text processing in the sphere of English-language social media. We introduc...
AbstractThe impact of Social media and SMS is increasing in our daily lives. These sources provide t...
One of the major challenges in the era of big data use is how to ‘clean’ the vast amount of data, pa...
We present a simple yet effective approach to adapt part-of-speech (POS) taggers to new domains. Our...
The ever-growing usage of social media platforms generates daily vast amounts of textual data which ...
One of the major challenges in the era of big data use is how to 'clean' the vast amount of data, pa...
Social media language contains huge amount and wide variety of nonstandard tokens, cre-ated both int...
Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and oth...
Social media platforms such as Twitter have grown at a tremendous pace in recent years and have beco...
The informal nature of social media text renders it very difficult to be automati-cally processed by...