Consistent sampling is a technique for specifying, in small space, a subset S of a potentially large universe U such that the elements in S satisfy a suitably chosen sampling condition. Given a subset I ⊆ U it should be possible to quickly compute I ∩ S, i.e., the elements in I satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream. In this paper we generalize consistent sampling to the setting where we are interested in sampling size-k subsets occurring in some set in a collection of sets of bounded size b, where k is a small integer. This can be done by applying standard consistent sampling to the k-subsets of each set, but...
Sequential sampling algorithms have recently attracted interest as a way to design scalable algorith...
We investigate the problem of counting the number of frequent (item)sets-a problem known to be intra...
Most of the complexity of common data mining tasks is due to the unknown amount of information conta...
Abstract. Consistent sampling is a technique for specifying, in small space, a subset S of a potenti...
Abstract. We study the use of sampling for efficiently mining the top-K frequent itemsets of cardina...
We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at m...
The tasks of extracting (top-K) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamenta...
Analyzing huge datasets becomes prohibitively slow when the dataset does not fit in main memory. App...
The tasks of extracting (top-K) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamenta...
While there has been a lot of work on finding frequent itemsets in transaction data streams, none of...
Abstract. Sampling a dataset for faster analysis and looking at it as a sample from an unknown distr...
We present an algorithm to extract an high-quality approximation of the (top-k) Frequent itemsets (F...
Sequential sampling algorithms have recently attracted interest as a way to design scalable algorith...
Many discovery problems, e.g., subgroup or association rule discovery, can naturally be cast as n-be...
Adaptive sampling [a1] is a probabilistic algorithm invented by M. Wegman (unpublished) around 1980....
Sequential sampling algorithms have recently attracted interest as a way to design scalable algorith...
We investigate the problem of counting the number of frequent (item)sets-a problem known to be intra...
Most of the complexity of common data mining tasks is due to the unknown amount of information conta...
Abstract. Consistent sampling is a technique for specifying, in small space, a subset S of a potenti...
Abstract. We study the use of sampling for efficiently mining the top-K frequent itemsets of cardina...
We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at m...
The tasks of extracting (top-K) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamenta...
Analyzing huge datasets becomes prohibitively slow when the dataset does not fit in main memory. App...
The tasks of extracting (top-K) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamenta...
While there has been a lot of work on finding frequent itemsets in transaction data streams, none of...
Abstract. Sampling a dataset for faster analysis and looking at it as a sample from an unknown distr...
We present an algorithm to extract an high-quality approximation of the (top-k) Frequent itemsets (F...
Sequential sampling algorithms have recently attracted interest as a way to design scalable algorith...
Many discovery problems, e.g., subgroup or association rule discovery, can naturally be cast as n-be...
Adaptive sampling [a1] is a probabilistic algorithm invented by M. Wegman (unpublished) around 1980....
Sequential sampling algorithms have recently attracted interest as a way to design scalable algorith...
We investigate the problem of counting the number of frequent (item)sets-a problem known to be intra...
Most of the complexity of common data mining tasks is due to the unknown amount of information conta...