This paper proposes an architecture, based on statistical machine translation, for developing the text normalization module of a text to speech conversion system. The main target is to generate a language independent text normalization module, based on data and flexible enough to deal with all situa-tions presented in this task. The proposed architecture is composed by three main modules: a tokenizer module for splitting the text input into a token graph (tokenization), a phrase-based translation module (token translation) and a post-processing module for removing some tokens. This paper presents initial exper-iments for numbers and abbreviations. The very good results obtained validate the proposed architecture
Compared to the edited genres that have played a central role in NLP research, mi-croblog texts use ...
The information contained in messages posted on the Internet (forums, social networks, review sites....
This paper presents a system to normalize Spanish tweets, which uses preprocessing rules, a domain-a...
This paper proposes an architecture, based on statistical machine translation, for developing the te...
This paper describes the text normalization module of a text to speech fully-trainable conversion sy...
International audienceThe creation of text corpora requires a sequence of processing steps in order ...
One of the most persistent characteristics of written user-generated content (UGC) is the use of non...
Text normalization is the task of mapping non-canonical language, typical of speech transcription an...
With the growth of the social web, user-generated text data has reached unprecedented sizes. Non-can...
Includes bibliographical references (page 4).This paper describes a process of text normalization sy...
Text normalization is the task of mapping noncanonical language, typical of speech transcription and...
This paper describes a text normalization system for deletion-based abbreviations in informal text. ...
Text normalization methods have been commonly applied to historical language or user-generated conte...
The creation of text corpora requires a sequence of processing steps in order to constitute, normali...
In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluatin...
Compared to the edited genres that have played a central role in NLP research, mi-croblog texts use ...
The information contained in messages posted on the Internet (forums, social networks, review sites....
This paper presents a system to normalize Spanish tweets, which uses preprocessing rules, a domain-a...
This paper proposes an architecture, based on statistical machine translation, for developing the te...
This paper describes the text normalization module of a text to speech fully-trainable conversion sy...
International audienceThe creation of text corpora requires a sequence of processing steps in order ...
One of the most persistent characteristics of written user-generated content (UGC) is the use of non...
Text normalization is the task of mapping non-canonical language, typical of speech transcription an...
With the growth of the social web, user-generated text data has reached unprecedented sizes. Non-can...
Includes bibliographical references (page 4).This paper describes a process of text normalization sy...
Text normalization is the task of mapping noncanonical language, typical of speech transcription and...
This paper describes a text normalization system for deletion-based abbreviations in informal text. ...
Text normalization methods have been commonly applied to historical language or user-generated conte...
The creation of text corpora requires a sequence of processing steps in order to constitute, normali...
In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluatin...
Compared to the edited genres that have played a central role in NLP research, mi-croblog texts use ...
The information contained in messages posted on the Internet (forums, social networks, review sites....
This paper presents a system to normalize Spanish tweets, which uses preprocessing rules, a domain-a...