Automatic Language Identification of written texts is a well-established area of research in Computational Linguistics. State-of-the-art algorithms often rely on n-gram character models to identify the correct language of texts, with good results seen for European languages. In this paper we propose the use of a character n-gram model and a word n-gram language model for the automatic classification of two written varieties of Portuguese: European and Brazilian. Results reached 0.998 for accuracy using character 4-grams
This paper extends the work of Cavnar and Trenkle N-gram text categorization [2], enhances the study...
Using standard methods and formats established at LADL, and adopted by several European research tea...
The ISOcat Data Category Registry (www.isocat.org) has been developed by ISO TC 37 and CLARIN to sha...
Automatic Language Identification of written texts is a well-established area of research in Computa...
Automatic Language Identification of written texts is a well-established area of research in Computa...
Automatic Language Identification of written texts is a well-established area of research in Computa...
Language identification is an important first step in many IR and NLP applications. Most publicly av...
This paper describes two automatic systems: a linguistic features extractor and a text readability c...
This paper describes two automatic systems: a linguistic features extractor and a text readability c...
In this paper we describe the most recent work within ISO TC37/SC 4, and in particular the developme...
Abstract—Language Identification is the process of determining in which natural language the content...
To achieve true interoperability for valuable linguistic resources different levels of variation nee...
The application, developed in C#, automatically identifies the language of a text written in one of ...
We present a statistical approach to text-based automatic language identification that focuses on di...
This paper extends the work of Cavnar and Trenkle N-gram text categorization [2], enhances the study...
This paper extends the work of Cavnar and Trenkle N-gram text categorization [2], enhances the study...
Using standard methods and formats established at LADL, and adopted by several European research tea...
The ISOcat Data Category Registry (www.isocat.org) has been developed by ISO TC 37 and CLARIN to sha...
Automatic Language Identification of written texts is a well-established area of research in Computa...
Automatic Language Identification of written texts is a well-established area of research in Computa...
Automatic Language Identification of written texts is a well-established area of research in Computa...
Language identification is an important first step in many IR and NLP applications. Most publicly av...
This paper describes two automatic systems: a linguistic features extractor and a text readability c...
This paper describes two automatic systems: a linguistic features extractor and a text readability c...
In this paper we describe the most recent work within ISO TC37/SC 4, and in particular the developme...
Abstract—Language Identification is the process of determining in which natural language the content...
To achieve true interoperability for valuable linguistic resources different levels of variation nee...
The application, developed in C#, automatically identifies the language of a text written in one of ...
We present a statistical approach to text-based automatic language identification that focuses on di...
This paper extends the work of Cavnar and Trenkle N-gram text categorization [2], enhances the study...
This paper extends the work of Cavnar and Trenkle N-gram text categorization [2], enhances the study...
Using standard methods and formats established at LADL, and adopted by several European research tea...
The ISOcat Data Category Registry (www.isocat.org) has been developed by ISO TC 37 and CLARIN to sha...