Tokenization and segmentation are steps performed in the earlier stages of most text analysis. It is normally a fast process done using deterministic algorithms [3] to establish token and sentence boundaries [5]. Inexact tokenization can negatively affect later processes and applications with the corpus. For instance, applying a dependency parser to a badly tokenized sequence yields errors beyond the span of the problematic token. Also, depending on the input (e.g. language, variety, register) and on the purpose of the study, distinct tokenization decisions might be expected. For instance, multiword expressions and ambiguous separators such as hashtags can be approached in different ways. This study explores three widely used tokenizers - T...
Corpus linguistic and language technological research needs empirical corpus data with nearly correc...
Corpora are often referred to as the ‘tools’ of corpus linguistics. However, it is important to reco...
Text mining is the process of extracting interesting and non-trivial knowledge or information from u...
Tokenization and segmentation are steps performed in the earlier stages of most text analysis. It is...
When comparing different tools in the field of natural language processing (NLP), the quality of the...
International audienceThis paper describes how a tokenizer can be trained from any dataset in the Un...
Current taggers assume that input texts are already tokenized, i.e. correctly segmented in tokens or...
Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokeniz...
One of the most common operations in language process-ing are segmentation and labelling [7]. Chunki...
Tokenization is the process of splitting running texts into minimal meaningful units. In writing sys...
Tokenization is the process of splitting running texts into minimal meaningful units. In writing sys...
© 2005 Andrew MacKinlayIn natural language processing (NLP), a crucial subsystem in a wide range of ...
15 page preprintWhat are the units of text that we want to model? From bytes to multi-word expressio...
Statistical n-gram taggers like that of [Church 1988] or [Foster 1991] assign a part-of-speech label...
This paper describes how a tokenizer can be trained from any dataset in the Universal Dependencies 2...
Corpus linguistic and language technological research needs empirical corpus data with nearly correc...
Corpora are often referred to as the ‘tools’ of corpus linguistics. However, it is important to reco...
Text mining is the process of extracting interesting and non-trivial knowledge or information from u...
Tokenization and segmentation are steps performed in the earlier stages of most text analysis. It is...
When comparing different tools in the field of natural language processing (NLP), the quality of the...
International audienceThis paper describes how a tokenizer can be trained from any dataset in the Un...
Current taggers assume that input texts are already tokenized, i.e. correctly segmented in tokens or...
Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokeniz...
One of the most common operations in language process-ing are segmentation and labelling [7]. Chunki...
Tokenization is the process of splitting running texts into minimal meaningful units. In writing sys...
Tokenization is the process of splitting running texts into minimal meaningful units. In writing sys...
© 2005 Andrew MacKinlayIn natural language processing (NLP), a crucial subsystem in a wide range of ...
15 page preprintWhat are the units of text that we want to model? From bytes to multi-word expressio...
Statistical n-gram taggers like that of [Church 1988] or [Foster 1991] assign a part-of-speech label...
This paper describes how a tokenizer can be trained from any dataset in the Universal Dependencies 2...
Corpus linguistic and language technological research needs empirical corpus data with nearly correc...
Corpora are often referred to as the ‘tools’ of corpus linguistics. However, it is important to reco...
Text mining is the process of extracting interesting and non-trivial knowledge or information from u...