This paper proposes an architecture, based on statistical machine translation, for developing the text normalization module of a text to speech conversion system. The main target is to generate a language independent text normalization module, based on data and flexible enough to deal with all situa-tions presented in this task. The proposed architecture is composed by three main modules: a tokenizer module for splitting the text input into a token graph (tokenization), a phrase-based translation module (token translation) and a post-processing module for removing some tokens. This paper presents initial exper-iments for numbers and abbreviations. The very good results obtained validate the proposed architecture
We present work in progress aiming to build tools for the normalization of User-Generated Content (U...
National audienceThe boom of natural language processing (NLP) is taking place in a world where more...
In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluatin...
This paper proposes an architecture, based on statistical machine translation, for developing the te...
This paper describes the text normalization module of a text to speech fully-trainable conversion sy...
International audienceThe creation of text corpora requires a sequence of processing steps in order ...
With the growth of the social web, user-generated text data has reached unprecedented sizes. Non-can...
One of the most persistent characteristics of written user-generated content (UGC) is the use of non...
Text normalization is the task of mapping non-canonical language, typical of speech transcription an...
The creation of text corpora requires a sequence of processing steps in order to constitute, normali...
Includes bibliographical references (page 4).This paper describes a process of text normalization sy...
This paper describes a text normalization system for deletion-based abbreviations in informal text. ...
Text normalization is the task of mapping noncanonical language, typical of speech transcription and...
Text normalization methods have been commonly applied to historical language or user-generated conte...
The information contained in messages posted on the Internet (forums, social networks, review sites....
We present work in progress aiming to build tools for the normalization of User-Generated Content (U...
National audienceThe boom of natural language processing (NLP) is taking place in a world where more...
In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluatin...
This paper proposes an architecture, based on statistical machine translation, for developing the te...
This paper describes the text normalization module of a text to speech fully-trainable conversion sy...
International audienceThe creation of text corpora requires a sequence of processing steps in order ...
With the growth of the social web, user-generated text data has reached unprecedented sizes. Non-can...
One of the most persistent characteristics of written user-generated content (UGC) is the use of non...
Text normalization is the task of mapping non-canonical language, typical of speech transcription an...
The creation of text corpora requires a sequence of processing steps in order to constitute, normali...
Includes bibliographical references (page 4).This paper describes a process of text normalization sy...
This paper describes a text normalization system for deletion-based abbreviations in informal text. ...
Text normalization is the task of mapping noncanonical language, typical of speech transcription and...
Text normalization methods have been commonly applied to historical language or user-generated conte...
The information contained in messages posted on the Internet (forums, social networks, review sites....
We present work in progress aiming to build tools for the normalization of User-Generated Content (U...
National audienceThe boom of natural language processing (NLP) is taking place in a world where more...
In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluatin...