Consider a distributed system with n nodes where each node holds a multiset of items. In this paper, we design sampling algorithms that allow us to estimate the global frequency of any item with a standard deviation of εN, where N denotes the total cardinality of all these multisets. Our algorithms have a communication cost of O(n + √n/ε), which is never worse than the O(n + 1/ε2) cost of uniform sampling, and could be much better when n ≪ 1/ε2. In addition, we prove that one version of our algorithm is instance-optimal in a fairly general sampling framework. We also design algorithms that achieve optimality on the bit level, by combining Bloom filters of various granularities. Finally, we present some simulation results comparing our algor...
In this dissertation, we make progress on certain algorithmic problems broadly over two computationa...
We consider the problem of maintaining frequency counts for items occurring frequently in the union ...
A fundamental problem in data management is to draw and maintain a sample of a large data set, for a...
Abstract—Consider a distributed system with n nodes where each node holds a multiset of items. In th...
Uniform sampling in networks is at the core of a wide variety of randomized algorithms. Random sampl...
A wide range of mining and analysis problems involve extracting knowledge from count data. Such data...
In this paper we show the power of sampling techniques in designing efficient distributed algorithms...
Recent work in sensor databases has focused extensively on distributed query problems, notably distr...
Recent work in sensor databases has focused extensively on distributed query problems, notably distr...
We consider the problem of maintaining frequency counts for items occurring frequently in the union ...
We give an improved algorithm for drawing a random sample from a large data stream when the input el...
We show that randomization can lead to significant improvements for a few fundamental problems in di...
Tracking frequent items (also called heavy hitters) is one of the most fundamental queries in real-t...
Abstract. Flow-level traffic measurement is important for network traffic accounting, traffic engine...
A significant progress in the evolution of the computer systems and their interconnection over the p...
In this dissertation, we make progress on certain algorithmic problems broadly over two computationa...
We consider the problem of maintaining frequency counts for items occurring frequently in the union ...
A fundamental problem in data management is to draw and maintain a sample of a large data set, for a...
Abstract—Consider a distributed system with n nodes where each node holds a multiset of items. In th...
Uniform sampling in networks is at the core of a wide variety of randomized algorithms. Random sampl...
A wide range of mining and analysis problems involve extracting knowledge from count data. Such data...
In this paper we show the power of sampling techniques in designing efficient distributed algorithms...
Recent work in sensor databases has focused extensively on distributed query problems, notably distr...
Recent work in sensor databases has focused extensively on distributed query problems, notably distr...
We consider the problem of maintaining frequency counts for items occurring frequently in the union ...
We give an improved algorithm for drawing a random sample from a large data stream when the input el...
We show that randomization can lead to significant improvements for a few fundamental problems in di...
Tracking frequent items (also called heavy hitters) is one of the most fundamental queries in real-t...
Abstract. Flow-level traffic measurement is important for network traffic accounting, traffic engine...
A significant progress in the evolution of the computer systems and their interconnection over the p...
In this dissertation, we make progress on certain algorithmic problems broadly over two computationa...
We consider the problem of maintaining frequency counts for items occurring frequently in the union ...
A fundamental problem in data management is to draw and maintain a sample of a large data set, for a...