AbstractThis paper explores the use of linguistic information for the selection of data to train language models. We depart from the state-of-the-art method in perplexity-based data selection and extend it in order to use word-level linguistic units (i.e. lemmas, named entity categories and part-of-speech tags) instead of surface forms. We then present two methods that combine the different types of linguistic knowledge as well as the surface forms (1, naïve selection of the top ranked sentences selected by each method; 2, linear interpolation of the datasets selected by the different methods). The paper presents detailed results and analysis for four languages with different levels of morphologic complexity (English, Spanish, Czech and Chi...
This study shows that using computational linguistic models is beneficial for descriptive linguistic...
Statistical language models are widely used in automatic speech recognition in order to constrain th...
How cross-linguistically applicable are NLP models, specifically language models? A fair comparison ...
Colloque avec actes et comité de lecture. internationale.International audienceIn this paper, we pro...
Data selection is an effective approach to domain adaptation in statistical ma-chine translation. Th...
260 pagesThe majority of work at the intersection of computational linguistics and natural language ...
In this work, we make a study on the effect of training set on statistical language modeling (SLM). ...
Thesis (Ph.D.)--University of Washington, 2014Machine translation, the computerized translation of o...
Statistical language modeling remains a challenging task, in particular for morphologically rich lan...
The increasing availability of large digital corpora of cross-linguistic data is revolutionizing man...
Abstract. In this paper, we study selection criteria for the use of word trigger pairs in statistica...
This paper presents a thorough study of the impact of morphology derivation on N-gram-based Statisti...
Abstract. In this paper, we present a language model based on clusters obtained by applying regular ...
Data selection has shown significant improvements in effective use of training data by extracting se...
Data Selection has emerged as a common issue in language technologies. We define Data Selection as t...
This study shows that using computational linguistic models is beneficial for descriptive linguistic...
Statistical language models are widely used in automatic speech recognition in order to constrain th...
How cross-linguistically applicable are NLP models, specifically language models? A fair comparison ...
Colloque avec actes et comité de lecture. internationale.International audienceIn this paper, we pro...
Data selection is an effective approach to domain adaptation in statistical ma-chine translation. Th...
260 pagesThe majority of work at the intersection of computational linguistics and natural language ...
In this work, we make a study on the effect of training set on statistical language modeling (SLM). ...
Thesis (Ph.D.)--University of Washington, 2014Machine translation, the computerized translation of o...
Statistical language modeling remains a challenging task, in particular for morphologically rich lan...
The increasing availability of large digital corpora of cross-linguistic data is revolutionizing man...
Abstract. In this paper, we study selection criteria for the use of word trigger pairs in statistica...
This paper presents a thorough study of the impact of morphology derivation on N-gram-based Statisti...
Abstract. In this paper, we present a language model based on clusters obtained by applying regular ...
Data selection has shown significant improvements in effective use of training data by extracting se...
Data Selection has emerged as a common issue in language technologies. We define Data Selection as t...
This study shows that using computational linguistic models is beneficial for descriptive linguistic...
Statistical language models are widely used in automatic speech recognition in order to constrain th...
How cross-linguistically applicable are NLP models, specifically language models? A fair comparison ...