We propose a novel method for using the World Wide Web to acquire trigram estimates for statistical language modeling. We submit an N-gram as a phrase query to web search engines. The search engines return the number of web pages containing the phrase, from which the N-gram count is estimated. The N-gram counts are then used to form web-based trigram probability estimates. We discuss the properties of such estimates, and methods to interpolate them with traditional corpus based trigram estimates. We show that the interpolated models improve speech recognition word error rate significantly over a small test set
International audienceThe World Wide Web is the greatest information space unseen until now, distrib...
n-gram language modeling is a popular technique used to improve performance of various NLP applicati...
Training language model made from conversational speech is difficult due to large variation of the w...
We propose a novel method for using the World Wide Web to ac-quire trigram estimates for statistical...
In this paper several methods are proposed for reducing the size of a trigram language model (LM), w...
This PhD thesis studies the overall effect of statistical language modeling on perplexity and word e...
Computational approaches in language identification often result in high number of false positives a...
this paper appears in Proceedings of the Third International Workshop on Parsing Technologies, 1993
Abstract. The 60-year-old dream of computational linguistics is to make computers capable of communi...
This article describes a methodology for collecting text from the Web to match a target sublanguage ...
International audienceThis paper describes an extension of the n-gram language model: the similar n-...
International audienceThis paper deals with the combination of a trigram and a triclass. This combin...
In domains with insufficient matched training data, language models are often constructed by interpo...
International audienceIn a series of preparatory experiments in 4 languages on subsets of the Europa...
ICSLP1998: the 5th International Conference on Spoken Language Processing, November 30 - December 4...
International audienceThe World Wide Web is the greatest information space unseen until now, distrib...
n-gram language modeling is a popular technique used to improve performance of various NLP applicati...
Training language model made from conversational speech is difficult due to large variation of the w...
We propose a novel method for using the World Wide Web to ac-quire trigram estimates for statistical...
In this paper several methods are proposed for reducing the size of a trigram language model (LM), w...
This PhD thesis studies the overall effect of statistical language modeling on perplexity and word e...
Computational approaches in language identification often result in high number of false positives a...
this paper appears in Proceedings of the Third International Workshop on Parsing Technologies, 1993
Abstract. The 60-year-old dream of computational linguistics is to make computers capable of communi...
This article describes a methodology for collecting text from the Web to match a target sublanguage ...
International audienceThis paper describes an extension of the n-gram language model: the similar n-...
International audienceThis paper deals with the combination of a trigram and a triclass. This combin...
In domains with insufficient matched training data, language models are often constructed by interpo...
International audienceIn a series of preparatory experiments in 4 languages on subsets of the Europa...
ICSLP1998: the 5th International Conference on Spoken Language Processing, November 30 - December 4...
International audienceThe World Wide Web is the greatest information space unseen until now, distrib...
n-gram language modeling is a popular technique used to improve performance of various NLP applicati...
Training language model made from conversational speech is difficult due to large variation of the w...