Elephant: Sequence labeling for word and sentence segmentation

Evang, Kilian
Basile, Valerio
Chrupala, Grzegorz
Bos, Johan

Publication date

January 2013

Abstract

Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. Like an elephant in the living room, it is a problem that is impossible to overlook whenever new raw datasets need to be processed or when tokenization conventions are reconsidered. It is moreover an important problem, because any errors occurring early in the pipeline affect further analysis negatively. We believe that regarding tokenization, there is still room for improvement, in particular on the methodological side of the task. We are particularly interested in the following questions: Can we use supervised learning to avoid hand-crafting rules...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Elephant: Sequence labeling for word and sentence segmentation

Abstract

Extracted data

Elephant: Sequence labeling for word and sentence segmentation

Abstract

Extracted data

Related items

Related items