The classification accuracy of text-based language identification depends on several factors, including the size of the text fragment to be identified, the amount of training data available, the classification features and algorithm employed, and the similarity of the languages to be identified. To date, no systematic study of these factors and their interactions has been published. We therefore investigate the effects of each of these factors and their relations on the performance of text-based language identification. Our study uses n-gram statistics as features for classification. In particular, we compare support vector machines, Naïve Bayesian and difference-in-frequency classifiers on different amounts of input text and various values...
This paper extends the work of Cavnar and Trenkle N-gram text categorization [2], enhances the study...
Text classification via supervised learning involves various steps from processing raw data, featur...
Language identification is widely used in machine learning, text mining, information retrieval, and ...
We investigate the performance of text-based language identification systems on the 11 offi-cial lan...
Language identification of written text has been studied for several decades. Despite this fact, mos...
We present a statistical approach to text-based automatic language identification that focuses on di...
In this paper, we explore the use of the Support Vector Machines (SVMs) to learn a discriminatively ...
Language identification is an important pre-process in many data management and information retrieva...
Language identification is a text classification task for identifying the language of a given text. ...
Text on the Internet is written in different languages and scripts that can be divided into differen...
In this paper we present two experiments conducted for comparison of different language identificati...
This paper describes three approaches to the task of automatically identifying the language a text i...
Abstract—Language Identification is the process of determining in which natural language the content...
Automatic language identification (LID) decisions are made based on scores of language models (LM). ...
This thesis has taken a closer look at the implementation of the back-end of a language recognition ...
This paper extends the work of Cavnar and Trenkle N-gram text categorization [2], enhances the study...
Text classification via supervised learning involves various steps from processing raw data, featur...
Language identification is widely used in machine learning, text mining, information retrieval, and ...
We investigate the performance of text-based language identification systems on the 11 offi-cial lan...
Language identification of written text has been studied for several decades. Despite this fact, mos...
We present a statistical approach to text-based automatic language identification that focuses on di...
In this paper, we explore the use of the Support Vector Machines (SVMs) to learn a discriminatively ...
Language identification is an important pre-process in many data management and information retrieva...
Language identification is a text classification task for identifying the language of a given text. ...
Text on the Internet is written in different languages and scripts that can be divided into differen...
In this paper we present two experiments conducted for comparison of different language identificati...
This paper describes three approaches to the task of automatically identifying the language a text i...
Abstract—Language Identification is the process of determining in which natural language the content...
Automatic language identification (LID) decisions are made based on scores of language models (LM). ...
This thesis has taken a closer look at the implementation of the back-end of a language recognition ...
This paper extends the work of Cavnar and Trenkle N-gram text categorization [2], enhances the study...
Text classification via supervised learning involves various steps from processing raw data, featur...
Language identification is widely used in machine learning, text mining, information retrieval, and ...