BulTreeBank Tokenizer

Simov, Kiril

Publication date

July 2014

Publisher

Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences

Abstract

The tokenizer is covering all languages that use Latin1, Laitn2, Latin3 and Cyrillic tables of Unicode. Can be extended to cover other tables in Unicode if necessary. The implementation is as a cascaded regular grammar in CLaRK. It recognizes over 60 token categories. It is easy to be adapted to new token categories

Extracted data

We use cookies to provide a better user experience.

Data Protection

BulTreeBank Tokenizer

Abstract

Extracted data

BulTreeBank Tokenizer

Abstract

Extracted data

Topics

Related items

Topics

Related items