Tokeniser for the Alsatian Dialects

MarsaTag

Rauzy, Stéphane Mr, CNRS - LPL

January 2015

Tools for textual data enrichment (written text and speech transcription) : tokenizer, morphosyntact...

Lexicon Based Critical Tokenisation: An Algorithm

Mills, Jon

January 1998

In some languages, spaces and punctuation marks are used to delimit word boundaries. This is the cas...

Formaliser les langues avec l'ordinateur. De INTEX à NooJ.

Koeva, Svetta
Maurel, Denis
Silberztein, Max

January 2007

INTEX and NooJ are linguistic development environments used as tools for the formalisation of natura...

Tokeniser for the Alsatian Dialects

Bernhard, Delphine

December 2018

A python module to tokenise texts in the Alsatian dialects. See the module header for help on how to...

Tokeniser for Picard

Todirascu Amalia (6001658)

November 2018

This software is developed for the tokenisation of Picard texts, e.g. splitting sentences into words...

Tokenization for Occitan (Gascon and Lengadocian)

Marianne Vergez-Couret

January 2019

A perl programme to tokenise texts in Occitan. The programme is an adaptation from the perl program...

huggingface/tokenizers: Python v0.11.4

Anthony MOI
Nicolas Patry
Pierric Cistac
Pete
Funtowicz Morgan
Sebastian Pütz
Bjarte Johansen
Thomas Wolf
Sylvain Gugger
Clement
Julien Chaumond
Lysandre Debut
François Garillot
Luc Georges
Taufiquzzaman Peyash
0xflotus
Alan deLevie
Alexander Mamaev
Colin Clement
Dagmawi Moges
Denis Zolotukhin
Geoffrey Thomas
Ivan Echevarria
JC Louis
Karan Desai
Koichi Yasuoka
MarcusGrass
Mishig Davaadorj
Mohamed Al Salti

January 2022

[#884] Fixing bad deserialization following inclusion of a default for Punctuatio

Annotated Corpus for the Alsatian Dialects

Bernhard, Delphine
Erhart, Pascale
Huck, Dominique
Steiblé, Lucie

February 2018

This corpus contains a collection of texts in the Alsatian dialects which were manually annotated wi...

Cutter – a Universal Multilingual Tokenizer

Graën, Johannes
Bertamini, Mara
Volk, Martin

June 2018

Tokenization is the process of splitting running texts into minimal meaningful units. In writing sys...

huggingface/tokenizers: Python v0.11.3

Anthony MOI
Pierric Cistac
Nicolas Patry
Pete
Funtowicz Morgan
Sebastian Pütz
Bjarte Johansen
Thomas Wolf
Sylvain Gugger
Clement
Julien Chaumond
Lysandre Debut
François Garillot
Luc Georges
Taufiquzzaman Peyash
0xflotus
Alan deLevie
Alexander Mamaev
Colin Clement
Dagmawi Moges
Denis Zolotukhin
Geoffrey Thomas
Ivan Echevarria
JC Louis
Karan Desai
Koichi Yasuoka
MarcusGrass
Mishig Davaadorj
Mohamed Al Salti

January 2022

#882 Fixing Punctuation deserialize without argument. #868 Fixing missing direction in TruncationPar...

KorAP/KorAP-Tokenizer: KorAP-Tokenizer v2.2.0

Marc Kupietz
Nils Diewald

July 2021

Updates Apostrophe and hyphen marked contractions and clitics in English (I've, isn't, Peter's, …...

Tokenisation in rule-based machine translation

Hurskainen, Arvi

March 2023

Tokenisation is a process, where text is converted into such form, where each item is separated from...

BulTreeBank Tokenizer

Simov, Kiril

July 2014

The tokenizer is covering all languages that use Latin1, Laitn2, Latin3 and Cyrillic tables of Unico...

Part-of-Speech Annotation Guidelines for the Alsatian Dialects

Bernhard, Delphine
Erhart, Pascale
Huck, Dominique
Steiblé, Lucie

February 2018

These guidelines were produced in the context of the RESTAURE project, funded by the French ANR. The...

LanguageMachines/frog: v0.17

Ko van der Sloot
Maarten van Gompel
Irishx

May 2019

some bug fixes trust the tokenizer to get the default language don't stumble upon empty sentences in...

MarsaTag

Rauzy, Stéphane Mr, CNRS - LPL

January 2015

Tools for textual data enrichment (written text and speech transcription) : tokenizer, morphosyntact...

Lexicon Based Critical Tokenisation: An Algorithm

Mills, Jon

January 1998

In some languages, spaces and punctuation marks are used to delimit word boundaries. This is the cas...

Formaliser les langues avec l'ordinateur. De INTEX à NooJ.

Koeva, Svetta
Maurel, Denis
Silberztein, Max

January 2007

INTEX and NooJ are linguistic development environments used as tools for the formalisation of natura...

Tokeniser for the Alsatian Dialects

Bernhard, Delphine

December 2018

A python module to tokenise texts in the Alsatian dialects. See the module header for help on how to...

Tokeniser for Picard

Todirascu Amalia (6001658)

November 2018

This software is developed for the tokenisation of Picard texts, e.g. splitting sentences into words...

Tokenization for Occitan (Gascon and Lengadocian)

Marianne Vergez-Couret

January 2019

A perl programme to tokenise texts in Occitan. The programme is an adaptation from the perl program...

huggingface/tokenizers: Python v0.11.4

Anthony MOI
Nicolas Patry
Pierric Cistac
Pete
Funtowicz Morgan
Sebastian Pütz
Bjarte Johansen
Thomas Wolf
Sylvain Gugger
Clement
Julien Chaumond
Lysandre Debut
François Garillot
Luc Georges
Taufiquzzaman Peyash
0xflotus
Alan deLevie
Alexander Mamaev
Colin Clement
Dagmawi Moges
Denis Zolotukhin
Geoffrey Thomas
Ivan Echevarria
JC Louis
Karan Desai
Koichi Yasuoka
MarcusGrass
Mishig Davaadorj
Mohamed Al Salti

January 2022

[#884] Fixing bad deserialization following inclusion of a default for Punctuatio

Annotated Corpus for the Alsatian Dialects

Bernhard, Delphine
Erhart, Pascale
Huck, Dominique
Steiblé, Lucie

February 2018

This corpus contains a collection of texts in the Alsatian dialects which were manually annotated wi...

Cutter – a Universal Multilingual Tokenizer

Graën, Johannes
Bertamini, Mara
Volk, Martin

June 2018

Tokenization is the process of splitting running texts into minimal meaningful units. In writing sys...

huggingface/tokenizers: Python v0.11.3

Anthony MOI
Pierric Cistac
Nicolas Patry
Pete
Funtowicz Morgan
Sebastian Pütz
Bjarte Johansen
Thomas Wolf
Sylvain Gugger
Clement
Julien Chaumond
Lysandre Debut
François Garillot
Luc Georges
Taufiquzzaman Peyash
0xflotus
Alan deLevie
Alexander Mamaev
Colin Clement
Dagmawi Moges
Denis Zolotukhin
Geoffrey Thomas
Ivan Echevarria
JC Louis
Karan Desai
Koichi Yasuoka
MarcusGrass
Mishig Davaadorj
Mohamed Al Salti

January 2022

#882 Fixing Punctuation deserialize without argument. #868 Fixing missing direction in TruncationPar...

KorAP/KorAP-Tokenizer: KorAP-Tokenizer v2.2.0

Marc Kupietz
Nils Diewald

July 2021

Updates Apostrophe and hyphen marked contractions and clitics in English (I've, isn't, Peter's, …...

Tokenisation in rule-based machine translation

Hurskainen, Arvi

March 2023

Tokenisation is a process, where text is converted into such form, where each item is separated from...

BulTreeBank Tokenizer

Simov, Kiril

July 2014

The tokenizer is covering all languages that use Latin1, Laitn2, Latin3 and Cyrillic tables of Unico...

Part-of-Speech Annotation Guidelines for the Alsatian Dialects

Bernhard, Delphine
Erhart, Pascale
Huck, Dominique
Steiblé, Lucie

February 2018

These guidelines were produced in the context of the RESTAURE project, funded by the French ANR. The...

LanguageMachines/frog: v0.17

Ko van der Sloot
Maarten van Gompel
Irishx

May 2019

some bug fixes trust the tokenizer to get the default language don't stumble upon empty sentences in...

MarsaTag

Rauzy, Stéphane Mr, CNRS - LPL

January 2015

Tools for textual data enrichment (written text and speech transcription) : tokenizer, morphosyntact...

Lexicon Based Critical Tokenisation: An Algorithm

Mills, Jon

January 1998

In some languages, spaces and punctuation marks are used to delimit word boundaries. This is the cas...

Formaliser les langues avec l'ordinateur. De INTEX à NooJ.

Koeva, Svetta
Maurel, Denis
Silberztein, Max

January 2007

INTEX and NooJ are linguistic development environments used as tools for the formalisation of natura...

Tokeniser for the Alsatian Dialects

Abstract

Extracted data

Tokeniser for the Alsatian Dialects

Abstract

Extracted data

Related items

Related items