We present herein our work on language identification applied to comments left by the readers of online news sites popular in Kazakhstan. Such comments are typically written in one of the two languages spoken widely in the area (Kazakh and Russian) and sometimes - in a mixture of both. Code-switching (mixing languages) makes it desirable to identify language not only on document, but also on individual word level. We approach both tasks in a single two-step framework, performing unsupervised normalization and Nave Bayes text classification procedures successively. Moreover, we applied deep learning model based on recurrent networks with LSTM cell in order to classify text. Our results suggest improvement over the state-of-the-art for Kazakh...
Speech recognition is a rapidly growing field in machine learning. Conventional automatic speech rec...
Code-mixing or language-mixing is a linguistic phenomenon where multiple language mix together durin...
Speech synthesis, or text-to-speech (TTS), has made significant progress in recent years, with deep ...
We address the problem of normalizing user generated content in a multilingual setting. Specifically...
Automatic language identification (LID) belongs to the automatic process whereby the identity of the...
The thesis explores the status quo of the Kazakh language in terms of corpus linguistics. The proj...
Automatic analyzing and extracting useful information from the noisy social media content are curren...
Research in the field of semantic text analysis begins with the study of the structure of natural la...
The world is growing more connected through the use of online communication, exposing software and h...
In social media communication, multilin-gual speakers often switch between lan-guages, and, in such ...
Uyghur is the second largest and most actively used social media language in China. However, a non-n...
This paper extends the work of Cavnar and Trenkle N-gram text categorization [2], enhances the study...
We present a statistical approach to text-based automatic language identification that focuses on di...
This article describes the methods of creating a system of recognizing the continuous speech of Kaza...
Recently, image-based text extraction has becomea prominent and hard study subject in computer visio...
Speech recognition is a rapidly growing field in machine learning. Conventional automatic speech rec...
Code-mixing or language-mixing is a linguistic phenomenon where multiple language mix together durin...
Speech synthesis, or text-to-speech (TTS), has made significant progress in recent years, with deep ...
We address the problem of normalizing user generated content in a multilingual setting. Specifically...
Automatic language identification (LID) belongs to the automatic process whereby the identity of the...
The thesis explores the status quo of the Kazakh language in terms of corpus linguistics. The proj...
Automatic analyzing and extracting useful information from the noisy social media content are curren...
Research in the field of semantic text analysis begins with the study of the structure of natural la...
The world is growing more connected through the use of online communication, exposing software and h...
In social media communication, multilin-gual speakers often switch between lan-guages, and, in such ...
Uyghur is the second largest and most actively used social media language in China. However, a non-n...
This paper extends the work of Cavnar and Trenkle N-gram text categorization [2], enhances the study...
We present a statistical approach to text-based automatic language identification that focuses on di...
This article describes the methods of creating a system of recognizing the continuous speech of Kaza...
Recently, image-based text extraction has becomea prominent and hard study subject in computer visio...
Speech recognition is a rapidly growing field in machine learning. Conventional automatic speech rec...
Code-mixing or language-mixing is a linguistic phenomenon where multiple language mix together durin...
Speech synthesis, or text-to-speech (TTS), has made significant progress in recent years, with deep ...