The presence of spam in a document ranking is a major issue for Web search engines. Common approaches that cope with spam remove from the document rankings those pages that are likely to contain spam. These approaches are implemented as post-retrieval processes, that filter out spam pages only after documents have been retrieved with respect to a user’s query. In this paper we suggest to remove spam pages at indexing time, therefore obtaining a pruned index that is virtually “spam-free”. We investigate the benefits of this approach from three points of view: indexing time, index size, and retrieval performances. Not surprisingly, we found that the strategy decreases both the time required by the indexing process and the space required for s...
The increasing importance of search engines to commercial web sites has given rise to a phenomenon w...
In this paper, we study the classification of web spam. Web spam refers to pages that use techniques...
Large web search engines process billions of queries each day over tens of billions of documents wit...
The presence of spam in a document ranking is a major issue for Web search engines. Common approache...
Past research in Adversarial Information Retrieval (AIR) has thoroughly addressed the detection of w...
The Web search engines maintain large-scale inverted indexes which are queried thousands of times pe...
Abstract Web spam potentially causes three deleterious effects: unnecessary work for crawlers and se...
Carterette, BenStatic index pruning methods have been proposed to reduce the index size of informati...
Meaningful evaluation of web search must take account of spam. Here we conduct a user experiment to ...
Information retrieval is the process of finding relevant information in large corpora of documents b...
AbstractWe propose a ranking algorithm to help search engine eliminate spam pages. On the basis of a...
Static index pruning techniques permanently remove a presumably redundant part of an inverted file, ...
High ranking of a Web site in search engines can be directly correlated to high revenues. This ampli...
Static index pruning techniques permanently remove a presumably redundant part of an inverted file, ...
Spam comprises at least 60% of the public web, and search engine companies invest considerable effor...
The increasing importance of search engines to commercial web sites has given rise to a phenomenon w...
In this paper, we study the classification of web spam. Web spam refers to pages that use techniques...
Large web search engines process billions of queries each day over tens of billions of documents wit...
The presence of spam in a document ranking is a major issue for Web search engines. Common approache...
Past research in Adversarial Information Retrieval (AIR) has thoroughly addressed the detection of w...
The Web search engines maintain large-scale inverted indexes which are queried thousands of times pe...
Abstract Web spam potentially causes three deleterious effects: unnecessary work for crawlers and se...
Carterette, BenStatic index pruning methods have been proposed to reduce the index size of informati...
Meaningful evaluation of web search must take account of spam. Here we conduct a user experiment to ...
Information retrieval is the process of finding relevant information in large corpora of documents b...
AbstractWe propose a ranking algorithm to help search engine eliminate spam pages. On the basis of a...
Static index pruning techniques permanently remove a presumably redundant part of an inverted file, ...
High ranking of a Web site in search engines can be directly correlated to high revenues. This ampli...
Static index pruning techniques permanently remove a presumably redundant part of an inverted file, ...
Spam comprises at least 60% of the public web, and search engine companies invest considerable effor...
The increasing importance of search engines to commercial web sites has given rise to a phenomenon w...
In this paper, we study the classification of web spam. Web spam refers to pages that use techniques...
Large web search engines process billions of queries each day over tens of billions of documents wit...