A large part of the data on the World Wide Web resides in the deep web. Executing structured, high-level queries on deep web data sources involves a number of challenges, several of which arise because query execution engines have a very limited access to data. In this paper, we consider the problem of executing aggregation queries involving data enumeration on these data sources, which requires sampling. The existing work in this area (HDSampler and its variants) is based on simple random sampling. We observe that this approach cannot obtain good estimates when the data is skewed. While there has been a lot of work on sampling skewed data, the existing methods are based on prior knowledge of data, and are therefore not applicable to hidden...
Top-$k$ query processing is a fundamental building block for efficient ranking in a large number of ...
In decision support applications, the ability to provide fast approximate answers to aggregation que...
With increasing amount of data in deep web sources (hidden from general search engines behind web fo...
Peer-to-peer databases are becoming prevalent on the Internet for distribution and sharing of docume...
Many databases on the web are “hidden ” behind (i.e., accessible only through) their restrictive, fo...
Big data is now being utilized widely and developed rapidly. The researches on big data area is mean...
Recently, there has been growing interest in random sampling from online hidden databases. These dat...
We consider the problem of efficiently sampling Web search engine query results. In turn, using a sm...
Aggregate query processing over very large datasets can be slow and prone to error due to dirty (mis...
We consider the problem of efficiently sampling Web search engine query results. In turn, using a sm...
Top-k query processing is a fundamental building block for efficient ranking in a large number of ap...
Abstract: The Information Era has witnessed a huge number of sources from websites. The abundance of...
Data delivered over the Internet is increasingly being used to provide dynamic and personalized user...
Data delivered over the Internet is increasingly being used to provide dynamic and personalized user...
AbstractAn estimation algorithm for a query is a probabilistic algorithm that computes an approximat...
Top-$k$ query processing is a fundamental building block for efficient ranking in a large number of ...
In decision support applications, the ability to provide fast approximate answers to aggregation que...
With increasing amount of data in deep web sources (hidden from general search engines behind web fo...
Peer-to-peer databases are becoming prevalent on the Internet for distribution and sharing of docume...
Many databases on the web are “hidden ” behind (i.e., accessible only through) their restrictive, fo...
Big data is now being utilized widely and developed rapidly. The researches on big data area is mean...
Recently, there has been growing interest in random sampling from online hidden databases. These dat...
We consider the problem of efficiently sampling Web search engine query results. In turn, using a sm...
Aggregate query processing over very large datasets can be slow and prone to error due to dirty (mis...
We consider the problem of efficiently sampling Web search engine query results. In turn, using a sm...
Top-k query processing is a fundamental building block for efficient ranking in a large number of ap...
Abstract: The Information Era has witnessed a huge number of sources from websites. The abundance of...
Data delivered over the Internet is increasingly being used to provide dynamic and personalized user...
Data delivered over the Internet is increasingly being used to provide dynamic and personalized user...
AbstractAn estimation algorithm for a query is a probabilistic algorithm that computes an approximat...
Top-$k$ query processing is a fundamental building block for efficient ranking in a large number of ...
In decision support applications, the ability to provide fast approximate answers to aggregation que...
With increasing amount of data in deep web sources (hidden from general search engines behind web fo...