Abstract. The lack of a large scale Chinese test collection is an obstacle to the Chinese information retrieval development. In order to address this issue, we built such a collection composed of millions of Chinese web pages, known as the Chinese Web Test collection with 100 gigabyte (CWT100g) in data volume, which is the largest Chinese web test collection as of this writing, and has been used by several dozen research groups besides being adopted in the evaluation of the SEWM-2004 Chinese Web Track[1] and the HTRDPE-2004[2]. We present the total solution for constructing a large scale test collection like the CWT100g. Further, we found that: 1) the distribution of the number of pages within sites obeys a Zipf-like law instead of a power ...
During the development of large language models (LLMs), the scale and quality of the pre-training da...
Competition for consumers to visit company websites has intensified in recent years. An important in...
Past research into text retrieval methods for the Web has been restricted by the lack of a test coll...
Abstract. The lack of a large scale Chinese test collection is an obstacle to the Chinese informatio...
The lack of a large scale Chinese test collection is an obstacle to the Chinese information retrieva...
As the amount of information on the Web and the number of inexperienced new users are growing rapidl...
In this paper, the user log of Tianwang, a large-scale distributed Chinese search engine system, is ...
To improve the precision of search engine and locate user-interesting Web page promptly, an investig...
users have been increasing tremendously during the past decade. Since Chinese language is significan...
Abstract:- Web filtering is an inductive process which automatically builds a filter by learning the...
The biggest information system of World Wide Web indexing is critical to estimate. Web is the benefi...
The study examined search engine coverage of websites across countries and domains. Websites in four...
Chinese information search engines always encounter a difficulty in segmentation of Chinese words fr...
Internet censorship measurements rely on lists of websites to be tested, or “block lists” that are c...
Various methods have been proposed for creating and maintaining lists of potentially filtered URLs t...
During the development of large language models (LLMs), the scale and quality of the pre-training da...
Competition for consumers to visit company websites has intensified in recent years. An important in...
Past research into text retrieval methods for the Web has been restricted by the lack of a test coll...
Abstract. The lack of a large scale Chinese test collection is an obstacle to the Chinese informatio...
The lack of a large scale Chinese test collection is an obstacle to the Chinese information retrieva...
As the amount of information on the Web and the number of inexperienced new users are growing rapidl...
In this paper, the user log of Tianwang, a large-scale distributed Chinese search engine system, is ...
To improve the precision of search engine and locate user-interesting Web page promptly, an investig...
users have been increasing tremendously during the past decade. Since Chinese language is significan...
Abstract:- Web filtering is an inductive process which automatically builds a filter by learning the...
The biggest information system of World Wide Web indexing is critical to estimate. Web is the benefi...
The study examined search engine coverage of websites across countries and domains. Websites in four...
Chinese information search engines always encounter a difficulty in segmentation of Chinese words fr...
Internet censorship measurements rely on lists of websites to be tested, or “block lists” that are c...
Various methods have been proposed for creating and maintaining lists of potentially filtered URLs t...
During the development of large language models (LLMs), the scale and quality of the pre-training da...
Competition for consumers to visit company websites has intensified in recent years. An important in...
Past research into text retrieval methods for the Web has been restricted by the lack of a test coll...