A python module to tokenise texts in the Alsatian dialects. See the module header for help on how to use the tokeniser. The module requires Python 2.7. This tool was developed in the context of the RESTAURE project, funded by the French ANR. The tokeniser is also decribed in the following article: https://hal.archives-ouvertes.fr/hal-01539160. Version 1.4.1 fixes a bug occurring when the space is missing after a comma
Tools for textual data enrichment (written text and speech transcription) : tokenizer, morphosyntact...
In some languages, spaces and punctuation marks are used to delimit word boundaries. This is the cas...
INTEX and NooJ are linguistic development environments used as tools for the formalisation of natura...
A python module to tokenise texts in the Alsatian dialects. See the module header for help on how to...
This software is developed for the tokenisation of Picard texts, e.g. splitting sentences into words...
A perl programme to tokenise texts in Occitan. The programme is an adaptation from the perl program...
[#884] Fixing bad deserialization following inclusion of a default for Punctuatio
This corpus contains a collection of texts in the Alsatian dialects which were manually annotated wi...
Tokenization is the process of splitting running texts into minimal meaningful units. In writing sys...
#882 Fixing Punctuation deserialize without argument. #868 Fixing missing direction in TruncationPar...
Updates Apostrophe and hyphen marked contractions and clitics in English (I've, isn't, Peter's, …...
Tokenisation is a process, where text is converted into such form, where each item is separated from...
The tokenizer is covering all languages that use Latin1, Laitn2, Latin3 and Cyrillic tables of Unico...
These guidelines were produced in the context of the RESTAURE project, funded by the French ANR. The...
some bug fixes trust the tokenizer to get the default language don't stumble upon empty sentences in...
Tools for textual data enrichment (written text and speech transcription) : tokenizer, morphosyntact...
In some languages, spaces and punctuation marks are used to delimit word boundaries. This is the cas...
INTEX and NooJ are linguistic development environments used as tools for the formalisation of natura...
A python module to tokenise texts in the Alsatian dialects. See the module header for help on how to...
This software is developed for the tokenisation of Picard texts, e.g. splitting sentences into words...
A perl programme to tokenise texts in Occitan. The programme is an adaptation from the perl program...
[#884] Fixing bad deserialization following inclusion of a default for Punctuatio
This corpus contains a collection of texts in the Alsatian dialects which were manually annotated wi...
Tokenization is the process of splitting running texts into minimal meaningful units. In writing sys...
#882 Fixing Punctuation deserialize without argument. #868 Fixing missing direction in TruncationPar...
Updates Apostrophe and hyphen marked contractions and clitics in English (I've, isn't, Peter's, …...
Tokenisation is a process, where text is converted into such form, where each item is separated from...
The tokenizer is covering all languages that use Latin1, Laitn2, Latin3 and Cyrillic tables of Unico...
These guidelines were produced in the context of the RESTAURE project, funded by the French ANR. The...
some bug fixes trust the tokenizer to get the default language don't stumble upon empty sentences in...
Tools for textual data enrichment (written text and speech transcription) : tokenizer, morphosyntact...
In some languages, spaces and punctuation marks are used to delimit word boundaries. This is the cas...
INTEX and NooJ are linguistic development environments used as tools for the formalisation of natura...