Text normalization methods have been commonly applied to historical language or user-generated content, but less often to dialectal transcriptions. In this paper, we introduce dialect-to-standard normalization - i.e., mapping phonetic transcriptions from different dialects to the orthographic norm of the standard variety - as a distinct sentence-level character transduction task and provide a large-scale analysis of dialect-to-standard normalization methods. To this end, we compile a multilingual dataset covering four languages: Finnish, Norwegian, Swiss German and Slovene. For the two biggest corpora, we provide three different data splits corresponding to different use cases for automatic normalization. We evaluate the most successful seq...
In this paper we describe the construction of a paral-lel corpus between the standard and a non-stan...
This paper presents a rule-based method for converting between colloquial Finnish and standard Finni...
International audienceWe propose a language-independent word normalization method exemplified on mod...
This paper evaluates various character alignment methods on the task of sentence-level standardizati...
Swiss dialects of German are, unlike most dialects of well standardised languages, widely used in ev...
Language label tokens are often used in multilingual neural language modeling and sequence-to-sequen...
The goal of this work is to design a machine translation (MT) system for a low-resource family of di...
This paper presents Murreviikko, a dataset of dialectal Finnish tweets which have been dialectologic...
There is no consensus on the state-of-the-art approach to historical text normalization. Many techni...
Spoken data from language-contact situations is extremely varied. This heterogeneity makes it diffic...
To study and automatically process Swiss German, it is necessary to resolve the issue of variation i...
The creation of text corpora requires a sequence of processing steps in order to constitute, normali...
International audienceThe creation of text corpora requires a sequence of processing steps in order ...
Text-to-Speech (TTS) normalization is an essential component of natural language processing (NLP) th...
Text normalization is the task of mapping non-canonical language, typical of speech transcription an...
In this paper we describe the construction of a paral-lel corpus between the standard and a non-stan...
This paper presents a rule-based method for converting between colloquial Finnish and standard Finni...
International audienceWe propose a language-independent word normalization method exemplified on mod...
This paper evaluates various character alignment methods on the task of sentence-level standardizati...
Swiss dialects of German are, unlike most dialects of well standardised languages, widely used in ev...
Language label tokens are often used in multilingual neural language modeling and sequence-to-sequen...
The goal of this work is to design a machine translation (MT) system for a low-resource family of di...
This paper presents Murreviikko, a dataset of dialectal Finnish tweets which have been dialectologic...
There is no consensus on the state-of-the-art approach to historical text normalization. Many techni...
Spoken data from language-contact situations is extremely varied. This heterogeneity makes it diffic...
To study and automatically process Swiss German, it is necessary to resolve the issue of variation i...
The creation of text corpora requires a sequence of processing steps in order to constitute, normali...
International audienceThe creation of text corpora requires a sequence of processing steps in order ...
Text-to-Speech (TTS) normalization is an essential component of natural language processing (NLP) th...
Text normalization is the task of mapping non-canonical language, typical of speech transcription an...
In this paper we describe the construction of a paral-lel corpus between the standard and a non-stan...
This paper presents a rule-based method for converting between colloquial Finnish and standard Finni...
International audienceWe propose a language-independent word normalization method exemplified on mod...