In this paper, we present an automated, quantitative, knowledge-poor method to evaluate the randomness of a collection of documents (corpus), with respect to a number of biased partitions. The method is based on the comparison of the word frequency distribution of the target corpus to word frequency distributions from corpora built in deliberately biased ways. We apply the method to the task of building a corpus via queries to Google. Our results indicate that this approach can be used, reliably, to discriminate biased and unbiased document collections and to choose the most appropriate query terms.
World Wide Web has become an important knowl-edge source for many research fields, and quality of We...
A central idea of Language Models is that documents (and perhaps queries) are random variables, gene...
Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenie...
In this paper, we present an automated, quantitative, knowledge-poor method to evaluate the randomne...
The Web is a very rich source of linguistic data, and in the last few years it has been used very in...
We consider the problem of estimating the size of a collec-tion of documents using only a standard q...
The quality of statistical measurements on corpora is strongly related to a strict definition of the...
Document fields, such as the title or the headings of a document, offer a way to consider the struct...
This paper describes a method for asking statistical questions about a large text corpus. The author...
International audienceIn this paper, we examine notions of text quality in the context of web corpus...
Corpora, large bodies of text, are of great importance to the field of Natural Language Processing. ...
Over the last decade, methods of web corpus construction and the evaluation of web corpora have been...
Abstract. The 60-year-old dream of computational linguistics is to make computers capable of communi...
Discovery Science : 4th InternationalConference, DS 2001, Washington, DC, USA, November 25-28, 2001....
Large digital text samples are promising sources for text-analytical research in the social sciences...
World Wide Web has become an important knowl-edge source for many research fields, and quality of We...
A central idea of Language Models is that documents (and perhaps queries) are random variables, gene...
Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenie...
In this paper, we present an automated, quantitative, knowledge-poor method to evaluate the randomne...
The Web is a very rich source of linguistic data, and in the last few years it has been used very in...
We consider the problem of estimating the size of a collec-tion of documents using only a standard q...
The quality of statistical measurements on corpora is strongly related to a strict definition of the...
Document fields, such as the title or the headings of a document, offer a way to consider the struct...
This paper describes a method for asking statistical questions about a large text corpus. The author...
International audienceIn this paper, we examine notions of text quality in the context of web corpus...
Corpora, large bodies of text, are of great importance to the field of Natural Language Processing. ...
Over the last decade, methods of web corpus construction and the evaluation of web corpora have been...
Abstract. The 60-year-old dream of computational linguistics is to make computers capable of communi...
Discovery Science : 4th InternationalConference, DS 2001, Washington, DC, USA, November 25-28, 2001....
Large digital text samples are promising sources for text-analytical research in the social sciences...
World Wide Web has become an important knowl-edge source for many research fields, and quality of We...
A central idea of Language Models is that documents (and perhaps queries) are random variables, gene...
Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenie...