With the growth of web data, how to estimate web page quality effectively and rapidly becomes more and more important for web information retrieval and knowledge discovery. This paper analyzes the differences between retrieval target pages and ordinary pages using query-independent features. Using these features, an algorithm called Linear Page Estimation (LPE) is proposed for web page quality estimation. Based on experiments on.GOV corpus and SOGOU corpus involving 26 million pages, about 95 % pages can be reduced with more than 90 % retrieval target pages retained using our algorithm. Experimental results based on TREC datasets also show that retrieval performance on collections selected by our algorithm can be close to or even better tha...
International audienceIn this paper, we present a framework for evaluating segmentation algorithms f...
A commerceial Web page typically contains many information blocks. Apart from the main content block...
The World Wide Web (WWW) is the repository of large number of web pages which can be accessed via In...
Quality information retrieval for the World Wide Web The World Wide Web is an unregulated communicat...
The World Wide Web and search engines are widely used, and getting good results from searches is imp...
We report on a study that was undertaken to better understand what kinds of Web pages are the most u...
Many existing retrieval approaches do not take into account the content quality of the retrieved doc...
Recent research has studied how to measure the size of a search engine, in terms of the number of pa...
In this paper, an approach for the implementation of a quality-based Web search engine is proposed. ...
Recent research has studied how to measure the size of a search engine, in terms of the number of pa...
Understanding what kinds of Web pages are the most useful for Web search engine users is a critical ...
The World Wide Web is an unregulated communication medium which exhibits very limited means of quali...
Recent research has studied how to measure the size of a search engine, in terms of the number of pa...
Currently, search engines rank search results using mainly linkbased metrics. While usually most of ...
While users can readily find information from the immense store of knowledge in the Web with the hel...
International audienceIn this paper, we present a framework for evaluating segmentation algorithms f...
A commerceial Web page typically contains many information blocks. Apart from the main content block...
The World Wide Web (WWW) is the repository of large number of web pages which can be accessed via In...
Quality information retrieval for the World Wide Web The World Wide Web is an unregulated communicat...
The World Wide Web and search engines are widely used, and getting good results from searches is imp...
We report on a study that was undertaken to better understand what kinds of Web pages are the most u...
Many existing retrieval approaches do not take into account the content quality of the retrieved doc...
Recent research has studied how to measure the size of a search engine, in terms of the number of pa...
In this paper, an approach for the implementation of a quality-based Web search engine is proposed. ...
Recent research has studied how to measure the size of a search engine, in terms of the number of pa...
Understanding what kinds of Web pages are the most useful for Web search engine users is a critical ...
The World Wide Web is an unregulated communication medium which exhibits very limited means of quali...
Recent research has studied how to measure the size of a search engine, in terms of the number of pa...
Currently, search engines rank search results using mainly linkbased metrics. While usually most of ...
While users can readily find information from the immense store of knowledge in the Web with the hel...
International audienceIn this paper, we present a framework for evaluating segmentation algorithms f...
A commerceial Web page typically contains many information blocks. Apart from the main content block...
The World Wide Web (WWW) is the repository of large number of web pages which can be accessed via In...