Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. Like an elephant in the living room, it is a problem that is impossible to overlook whenever new raw datasets need to be processed or when tokenization conventions are reconsidered. It is moreover an important problem, because any errors occurring early in the pipeline affect further analysis negatively. We believe that regarding tokenization, there is still room for improvement, in particular on the methodological side of the task. We are particularly interested in the following questions: Can we use supervised learning to avoid hand-crafting rules...
Current taggers assume that input texts are already tokenized, i.e. correctly segmented in tokens or...
Detecting the sentence boundary is one of the crucial pre-processing steps in natural language proce...
This paper presents a model of language processing where word segmentation is an integral part of se...
Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokeniz...
International audienceTokenization is widely regarded as a solved problem due to the high accuracy t...
International audienceThis paper describes how a tokenizer can be trained from any dataset in the Un...
In this work we address the problems of sentence segmentation and tokenization. Informally the task ...
Tokenization and segmentation are steps performed in the earlier stages of most text analysis. It is...
Tokenization and segmentation are steps performed in the earlier stages of most text analysis. It is...
Can attention- or gradient-based visualization techniques be used to infer token-level labels for bi...
Fast re-training of word segmentation models is required for adapting to new resources or domains in...
Learning to construct text representations in end-to-end systems can be difficult, as natural langua...
This paper describes how a tokenizer can be trained from any dataset in the Universal Dependencies 2...
One of the most common operations in language process-ing are segmentation and labelling [7]. Chunki...
In this thesis, we present a data-driven system for disambiguating token and sentence boundaries. Th...
Current taggers assume that input texts are already tokenized, i.e. correctly segmented in tokens or...
Detecting the sentence boundary is one of the crucial pre-processing steps in natural language proce...
This paper presents a model of language processing where word segmentation is an integral part of se...
Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokeniz...
International audienceTokenization is widely regarded as a solved problem due to the high accuracy t...
International audienceThis paper describes how a tokenizer can be trained from any dataset in the Un...
In this work we address the problems of sentence segmentation and tokenization. Informally the task ...
Tokenization and segmentation are steps performed in the earlier stages of most text analysis. It is...
Tokenization and segmentation are steps performed in the earlier stages of most text analysis. It is...
Can attention- or gradient-based visualization techniques be used to infer token-level labels for bi...
Fast re-training of word segmentation models is required for adapting to new resources or domains in...
Learning to construct text representations in end-to-end systems can be difficult, as natural langua...
This paper describes how a tokenizer can be trained from any dataset in the Universal Dependencies 2...
One of the most common operations in language process-ing are segmentation and labelling [7]. Chunki...
In this thesis, we present a data-driven system for disambiguating token and sentence boundaries. Th...
Current taggers assume that input texts are already tokenized, i.e. correctly segmented in tokens or...
Detecting the sentence boundary is one of the crucial pre-processing steps in natural language proce...
This paper presents a model of language processing where word segmentation is an integral part of se...