A study of commonly used tokenizers in Corpus Linguistics

Rodrigues Gomide, Andressa
Fekete, João

Publication date

September 2022

DOI

Abstract

Tokenization and segmentation are steps performed in the earlier stages of most text analysis. It is normally a fast process done using deterministic algorithms [3] to establish token and sentence boundaries [5]. Inexact tokenization can negatively affect later processes and applications with the corpus. For instance, applying a dependency parser to a badly tokenized sequence yields errors beyond the span of the problematic token. Also, depending on the input (e.g. language, variety, register) and on the purpose of the study, distinct tokenization decisions might be expected. For instance, multiword expressions and ambiguous separators such as hashtags can be approached in different ways. This study explores three widely used tokenizers - T...

Extracted data

We use cookies to provide a better user experience.

Data Protection

A study of commonly used tokenizers in Corpus Linguistics

Abstract

Extracted data

A study of commonly used tokenizers in Corpus Linguistics

Abstract

Extracted data

Related items

Related items