The identification of repeated n-gram phrases in text has many practical applications, including authorship attribution, text reuse identification, and plagiarism detection. We consider methods for finding the repeated n-grams in text corpora, with emphasis on techniques that can be effectively scaled across a cluster of processors to handle very large amounts of text. We compare our proposed method to existing techniques using the 1.5 TB TREC ClueWeb-B text collection, using both single-processor and multiprocessor approaches. The experiments show that our method offers an important tradeoff between speed and temporary storage space, and provides an alternative to previous approaches that scales almost linearly in the length of the sequenc...
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned te...
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned te...
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned te...
The automatic detection of shared content in written docu- ments –which includes text reuse and its ...
The automatic detection of shared content in written docu- ments –which includes text reuse and its ...
none4The automatic detection of shared content in written docu- ments –which includes text reuse and...
We have analyzed the SPEX algorithm by Bernstein and Zobel (2004) for detecting co-derivative docume...
This paper considers the issue of frequency consolidation in lists of different length word n-grams ...
This paper considers the issue of frequency consolidation in lists of different length word n-grams ...
This paper considers the issue of frequency consolidation in lists of different length word n-grams ...
This paper considers the issue of frequency consolidation in lists of different length word n-grams ...
This paper deals with the two fundamental problems concerning the handling of large n-gram language ...
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned te...
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned te...
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned te...
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned te...
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned te...
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned te...
The automatic detection of shared content in written docu- ments –which includes text reuse and its ...
The automatic detection of shared content in written docu- ments –which includes text reuse and its ...
none4The automatic detection of shared content in written docu- ments –which includes text reuse and...
We have analyzed the SPEX algorithm by Bernstein and Zobel (2004) for detecting co-derivative docume...
This paper considers the issue of frequency consolidation in lists of different length word n-grams ...
This paper considers the issue of frequency consolidation in lists of different length word n-grams ...
This paper considers the issue of frequency consolidation in lists of different length word n-grams ...
This paper considers the issue of frequency consolidation in lists of different length word n-grams ...
This paper deals with the two fundamental problems concerning the handling of large n-gram language ...
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned te...
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned te...
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned te...
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned te...
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned te...
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned te...