Abstract—We consider the problem of efficiently storing n-gram counts for large n over very large corpora. In such cases, the efficient storage of sufficient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with many attributes is the ADtree. Here, we adapt the ADtree to benefit from the sequential structure of corpora-type data. We demonstrate the usefulness of our approach on a portion of the well-known Wall Street Journal corpus from the Penn Treebank and show that our approach is exponentially more efficient than the naı̈ve approach to storing n-grams and is also significantly more efficient than a traditional prefix tree. I
ADtrees, a data structure useful for caching sufficient statistics, have been successfully adapted t...
In highly repetitive strings, like collections of genomes from the same species, distinct measures o...
In highly repetitive strings, like collections of genomes from the same species, distinct measures o...
The identification of repeated n-gram phrases in text has many practical applications, including aut...
Abstract — Statistics about n-grams (i.e., sequences of contigu-ous words or other tokens in text do...
Ingcreasingly, data-mining algorithms must deal with databases that continuously grow over time. The...
This paper introduces new algorithms and data structures for quick counting for machine learning dat...
This paper introduces new algorithms and data structures for quick counting for machine learning dat...
The problem of discovering association rules in large data-bases has received considerable research ...
This paper deals with the two fundamental problems concerning the handling of large n-gram language ...
There is a wide diversity of applications relying on the identification of the sequences of n consec...
Discovering frequent structures within large natural language corpora is one of the core problems of...
In computational linguistics, large tree databases tagged with morpho-syntactic information are in n...
A significant problem in computer science is the management of large data strings and a great number...
ADtrees, a data structure useful for caching sufficient statistics, have been successfully adapted t...
ADtrees, a data structure useful for caching sufficient statistics, have been successfully adapted t...
In highly repetitive strings, like collections of genomes from the same species, distinct measures o...
In highly repetitive strings, like collections of genomes from the same species, distinct measures o...
The identification of repeated n-gram phrases in text has many practical applications, including aut...
Abstract — Statistics about n-grams (i.e., sequences of contigu-ous words or other tokens in text do...
Ingcreasingly, data-mining algorithms must deal with databases that continuously grow over time. The...
This paper introduces new algorithms and data structures for quick counting for machine learning dat...
This paper introduces new algorithms and data structures for quick counting for machine learning dat...
The problem of discovering association rules in large data-bases has received considerable research ...
This paper deals with the two fundamental problems concerning the handling of large n-gram language ...
There is a wide diversity of applications relying on the identification of the sequences of n consec...
Discovering frequent structures within large natural language corpora is one of the core problems of...
In computational linguistics, large tree databases tagged with morpho-syntactic information are in n...
A significant problem in computer science is the management of large data strings and a great number...
ADtrees, a data structure useful for caching sufficient statistics, have been successfully adapted t...
ADtrees, a data structure useful for caching sufficient statistics, have been successfully adapted t...
In highly repetitive strings, like collections of genomes from the same species, distinct measures o...
In highly repetitive strings, like collections of genomes from the same species, distinct measures o...