International audienceThe creation of text corpora requires a sequence of processing steps in order to constitute, normalize, and then to directly exploit it by a given application. This paper presents a generic approach for text normalization and concentrates on the aspects of methodology and linguistic engineering, which serve to develop a multipurpose multilingual text corpus. This approach was applied to French, English, Spanish, Vietnamese, Khmer and Chinese. It consists in splitting the text normalization problem in a set of minor sub-problems as language-independent as possible. A set of text corpus normalization tools with linked resources and a document structuring method are proposed.<BR /
National audienceThe boom of natural language processing (NLP) is taking place in a world where more...
Normalization has different meanings in translation studies. It may refer to a process of standardiz...
International audienceText normalization is a necessity to correct and make more sense of the micro-...
International audienceThe creation of text corpora requires a sequence of processing steps in order ...
The creation of text corpora requires a sequence of processing steps in order to constitute, normali...
International audienceThe creation of text corpora requires a sequence of processing steps in order ...
This paper describes the text normalization module of a text to speech fully-trainable conversion sy...
This paper proposes an architecture, based on statistical machine translation, for developing the te...
One of the most persistent characteristics of written user-generated content (UGC) is the use of non...
Text normalization methods have been commonly applied to historical language or user-generated conte...
Lexical normalization is the task of transforming an utterance into its standardized form. This task...
A finalised digital resource of 88,000 anonymised French text messages, the 88milSMS corpus, two ext...
In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluatin...
One of the main objectives of corpus-based translation studies is to describe the differences betwee...
With the growth of the social web, user-generated text data has reached unprecedented sizes. Non-can...
National audienceThe boom of natural language processing (NLP) is taking place in a world where more...
Normalization has different meanings in translation studies. It may refer to a process of standardiz...
International audienceText normalization is a necessity to correct and make more sense of the micro-...
International audienceThe creation of text corpora requires a sequence of processing steps in order ...
The creation of text corpora requires a sequence of processing steps in order to constitute, normali...
International audienceThe creation of text corpora requires a sequence of processing steps in order ...
This paper describes the text normalization module of a text to speech fully-trainable conversion sy...
This paper proposes an architecture, based on statistical machine translation, for developing the te...
One of the most persistent characteristics of written user-generated content (UGC) is the use of non...
Text normalization methods have been commonly applied to historical language or user-generated conte...
Lexical normalization is the task of transforming an utterance into its standardized form. This task...
A finalised digital resource of 88,000 anonymised French text messages, the 88milSMS corpus, two ext...
In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluatin...
One of the main objectives of corpus-based translation studies is to describe the differences betwee...
With the growth of the social web, user-generated text data has reached unprecedented sizes. Non-can...
National audienceThe boom of natural language processing (NLP) is taking place in a world where more...
Normalization has different meanings in translation studies. It may refer to a process of standardiz...
International audienceText normalization is a necessity to correct and make more sense of the micro-...