We contribute 5-gram counts and language models trained on the Common Crawl corpus, a collection over 9 billion web pages. This release improves upon the Google n-gram counts in two key ways: the inclusion of low-count entries and deduplication to reduce boilerplate. By preserving singletons, we were able to use Kneser-Ney smoothing to build large language models. This paper describes how the corpus was processed with emphasis on the problems that arise in working with data at this scale. Our unpruned Kneser-Ney English 5-gram language model, built on 975 billion deduplicated tokens, contains over 500 billion unique n-grams. We show gains of 0.5–1.4 BLEU by using large language models to translate into various languages
© 2015 Lyan Verwimp, Joris Pelemans, Hugo Van hamme, Patrick Wambacq. The subject of this paper is t...
We introduce a novel approach for build-ing language models based on a system-atic, recursive explor...
EUROSPEECH2001: the 7th European Conference on Speech Communication and Technology, September 3-7, ...
We contribute 5-gram counts and language models trained on the Common Crawl corpus, a collection ove...
We introduce a novel approach for building language models based on a systematic, recursive explorat...
N-gram language models are an essential component in statistical natural language processing systems...
This paper reports on the benefits of largescale statistical language modeling in machine translatio...
We present DEPCC, the largest-to-date linguistically analyzed corpus in English including 365 millio...
In recent years neural language models (LMs) have set state-of-the-art performance for several bench...
Statistical n-gram language modeling is used in many domains like speech recognition, language ident...
We propose a new benchmark corpus to be used for measuring progress in statistical lan-guage modelin...
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
Verwimp L., Pelemans J., Van hamme H., Wambacq P., ''Extending n-gram language models based on equiv...
International audienceIn this chapter, it is shown how we can develop a new type of learner’s or stu...
International audienceThis paper describes an extension of the n-gram language model: the similar n-...
© 2015 Lyan Verwimp, Joris Pelemans, Hugo Van hamme, Patrick Wambacq. The subject of this paper is t...
We introduce a novel approach for build-ing language models based on a system-atic, recursive explor...
EUROSPEECH2001: the 7th European Conference on Speech Communication and Technology, September 3-7, ...
We contribute 5-gram counts and language models trained on the Common Crawl corpus, a collection ove...
We introduce a novel approach for building language models based on a systematic, recursive explorat...
N-gram language models are an essential component in statistical natural language processing systems...
This paper reports on the benefits of largescale statistical language modeling in machine translatio...
We present DEPCC, the largest-to-date linguistically analyzed corpus in English including 365 millio...
In recent years neural language models (LMs) have set state-of-the-art performance for several bench...
Statistical n-gram language modeling is used in many domains like speech recognition, language ident...
We propose a new benchmark corpus to be used for measuring progress in statistical lan-guage modelin...
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
Verwimp L., Pelemans J., Van hamme H., Wambacq P., ''Extending n-gram language models based on equiv...
International audienceIn this chapter, it is shown how we can develop a new type of learner’s or stu...
International audienceThis paper describes an extension of the n-gram language model: the similar n-...
© 2015 Lyan Verwimp, Joris Pelemans, Hugo Van hamme, Patrick Wambacq. The subject of this paper is t...
We introduce a novel approach for build-ing language models based on a system-atic, recursive explor...
EUROSPEECH2001: the 7th European Conference on Speech Communication and Technology, September 3-7, ...