Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one, are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification problem on a corpus that has been processed primarily for the purpose of searching, and thus our access to documents is solely through the inverted index of a large scale search engine. Our main goal is to build the "best" short query that characterizes a document class using operators normally available within large engines. We show that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms. Moreover, we show that optimizi...
Large-scale Parallel Web Search Engines (WSEs) needs to adopt a strategy for partitioning the invert...
Search engines are exceptionally important tools for accessing information in today’s world. In sati...
Web search engines have to deal with a rapidly increasing amount of information, high query loads an...
Modern search engines face enormous performance challenges. The most popular ones process tens of th...
Previous research into the efficiency of text retrieval systems has dealt primarily with methods tha...
Previous research into the efficiency of text retrieval systems has dealt primarily with methods tha...
Search engines are exceptionally important tools for accessing information in today’s world. In sati...
Web Search Engines (WSEs) are probably nowadays the most complex information systems since they need...
We propose a methodology for building a robust query classification system that can identify thou-sa...
Text search engines return a set of k documents ranked by similarity to a query. Typically, document...
In this paper, we describe a classifier based retrieval scheme for efficiently and accurately retrie...
In this paper, we introduce a new collection selection strategy to be operated in search engines wit...
The Text mining and Data mining supports different kinds of algorithms for classification of large d...
Taxonomies of the Web typically have hundreds of thousands of categories and skewed category distrib...
With the advent of Web, text information is being generated across the globe at an unfathomable rate...
Large-scale Parallel Web Search Engines (WSEs) needs to adopt a strategy for partitioning the invert...
Search engines are exceptionally important tools for accessing information in today’s world. In sati...
Web search engines have to deal with a rapidly increasing amount of information, high query loads an...
Modern search engines face enormous performance challenges. The most popular ones process tens of th...
Previous research into the efficiency of text retrieval systems has dealt primarily with methods tha...
Previous research into the efficiency of text retrieval systems has dealt primarily with methods tha...
Search engines are exceptionally important tools for accessing information in today’s world. In sati...
Web Search Engines (WSEs) are probably nowadays the most complex information systems since they need...
We propose a methodology for building a robust query classification system that can identify thou-sa...
Text search engines return a set of k documents ranked by similarity to a query. Typically, document...
In this paper, we describe a classifier based retrieval scheme for efficiently and accurately retrie...
In this paper, we introduce a new collection selection strategy to be operated in search engines wit...
The Text mining and Data mining supports different kinds of algorithms for classification of large d...
Taxonomies of the Web typically have hundreds of thousands of categories and skewed category distrib...
With the advent of Web, text information is being generated across the globe at an unfathomable rate...
Large-scale Parallel Web Search Engines (WSEs) needs to adopt a strategy for partitioning the invert...
Search engines are exceptionally important tools for accessing information in today’s world. In sati...
Web search engines have to deal with a rapidly increasing amount of information, high query loads an...