While parsing performance on in-domain text has developed steadily in recent years, out-of-domain text and grammatically noisy text remain an obstacle and often lead to significant decreases in parsing accuracy. In this thesis, we focus on the parsing of noisy content, such as user-generated content in services like Twitter. We investigate the question whether a preprocessing step based on machine translation techniques and unsupervised models for text-normalization can improve parsing performance on noisy data. Existing data sets are evaluated and a new data set for dependency parsing of grammatically noisy Twitter data is introduced. We show that text-normalization together with a combination of domain-specific and generic part-of-speech ...
User generated texts on the web are freely-available and lucrative sources of data for language tech...
International audienceFor the purpose of POS tagging noisy user-generated text, should normalization...
We investigate the problem of parsing the noisy language of social media. We evaluate four all-Stree...
While parsing performance on in-domain text has developed steadily in recent years, out-of-domain te...
We introduce the Denoised Web Treebank: a treebank including a normalization layer and a correspondi...
The amount of data produced in user-generated content continues to grow at a staggering rate. Howeve...
One of the major challenges in the era of big data use is how to 'clean' the vast amount of data, pa...
One of the major challenges in the era of big data use is how to ‘clean’ the vast amount of data, pa...
The automatic analysis (parsing) of natural language is an important ingredient for many natural lan...
One of the major problems in the era of big data use is how to ‘clean’ the vast amount of data on th...
Obtaining meaning-rich representations of social media inputs, such as Tweets (unstructured and nois...
The ever-growing usage of social media platforms generates daily vast amounts of textual data which ...
One of the main characteristics of social media data is the use of non-standard language. Since NLP ...
Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and oth...
We evaluate the statistical dependency parser, Malt, on a new dataset of sentences taken from tweets...
User generated texts on the web are freely-available and lucrative sources of data for language tech...
International audienceFor the purpose of POS tagging noisy user-generated text, should normalization...
We investigate the problem of parsing the noisy language of social media. We evaluate four all-Stree...
While parsing performance on in-domain text has developed steadily in recent years, out-of-domain te...
We introduce the Denoised Web Treebank: a treebank including a normalization layer and a correspondi...
The amount of data produced in user-generated content continues to grow at a staggering rate. Howeve...
One of the major challenges in the era of big data use is how to 'clean' the vast amount of data, pa...
One of the major challenges in the era of big data use is how to ‘clean’ the vast amount of data, pa...
The automatic analysis (parsing) of natural language is an important ingredient for many natural lan...
One of the major problems in the era of big data use is how to ‘clean’ the vast amount of data on th...
Obtaining meaning-rich representations of social media inputs, such as Tweets (unstructured and nois...
The ever-growing usage of social media platforms generates daily vast amounts of textual data which ...
One of the main characteristics of social media data is the use of non-standard language. Since NLP ...
Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and oth...
We evaluate the statistical dependency parser, Malt, on a new dataset of sentences taken from tweets...
User generated texts on the web are freely-available and lucrative sources of data for language tech...
International audienceFor the purpose of POS tagging noisy user-generated text, should normalization...
We investigate the problem of parsing the noisy language of social media. We evaluate four all-Stree...