This paper presents Murreviikko, a dataset of dialectal Finnish tweets which have been dialectologically annotated and manually normalized to a standard form. The dataset can be used as a test set for dialect identification and dialect-to-standard normalization, for instance. We evaluate the dataset on the normalization task, comparing an existing normalization model built on a spoken dialect corpus and three newly trained models with different architectures. We find that there are significant differences in normalization difficulty between the dialects, and that a character-level statistical machine translation model performs best on the Murreviikko tweet dataset.Peer reviewe
This article presents the Nordic Tweet Stream (NTS), a cross-disciplinarycorpus project of computer ...
The ever-growing usage of social media platforms generates daily vast amounts of textual data which ...
ReLDI-NormTag-hr 1.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standa...
Text normalization methods have been commonly applied to historical language or user-generated conte...
Dialectal data and normalization models presented in the following paper: Hämäläinen, M., Alnajjar,...
The data used in the paper "Finnish Dialect Identification: The Effect of Audio and Text". If you u...
Language label tokens are often used in multilingual neural language modeling and sequence-to-sequen...
Social media provides huge amounts of potential data for natural language processing but using this ...
In this paper we discuss the parallel manual normalisation of samples extracted from Croatian and Se...
We adopt automatic language recognition methods to study di-alect levelling — a phenomenon that lead...
Data used in our Swedish normalization paper: Hämäläinen, M; Partanen, N & Alnajjar, K (2020) Norma...
One of the main characteristics of social media data is the use of non-standard language. Since NLP ...
This paper presents a rule-based method for converting between colloquial Finnish and standard Finni...
One of the major challenges in the era of big data use is how to 'clean' the vast amount of data, pa...
One of the major challenges in the era of big data use is how to ‘clean’ the vast amount of data, pa...
This article presents the Nordic Tweet Stream (NTS), a cross-disciplinarycorpus project of computer ...
The ever-growing usage of social media platforms generates daily vast amounts of textual data which ...
ReLDI-NormTag-hr 1.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standa...
Text normalization methods have been commonly applied to historical language or user-generated conte...
Dialectal data and normalization models presented in the following paper: Hämäläinen, M., Alnajjar,...
The data used in the paper "Finnish Dialect Identification: The Effect of Audio and Text". If you u...
Language label tokens are often used in multilingual neural language modeling and sequence-to-sequen...
Social media provides huge amounts of potential data for natural language processing but using this ...
In this paper we discuss the parallel manual normalisation of samples extracted from Croatian and Se...
We adopt automatic language recognition methods to study di-alect levelling — a phenomenon that lead...
Data used in our Swedish normalization paper: Hämäläinen, M; Partanen, N & Alnajjar, K (2020) Norma...
One of the main characteristics of social media data is the use of non-standard language. Since NLP ...
This paper presents a rule-based method for converting between colloquial Finnish and standard Finni...
One of the major challenges in the era of big data use is how to 'clean' the vast amount of data, pa...
One of the major challenges in the era of big data use is how to ‘clean’ the vast amount of data, pa...
This article presents the Nordic Tweet Stream (NTS), a cross-disciplinarycorpus project of computer ...
The ever-growing usage of social media platforms generates daily vast amounts of textual data which ...
ReLDI-NormTag-hr 1.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standa...