A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors. Each book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. These words are referred to as ``unique words\u27\u27 and they constitute a small percentage of all the words in a typical book. Along with the order information the set of unique words provides a compact representation which is highly descriptive of the content and the flow of ideas in the book. By aligning the sequence of unique words from two books using the longest common subsequence (LCS) one can discover whether two books are duplicates. Experiments ...
Abstract. Near-duplicates are abundant in short text databases. Detecting and eliminating them is of...
This paper introduces a framework for clarifying and formalizing the duplicate document detection pr...
Although much work has been done on duplicate document detection (DDD) and its applications, we obse...
Millions of books from public libraries and private collections have been scanned by various organiz...
As the the amount of books available online the sizes of each these collections are at the same pace...
As the the amount of books available online the sizes of each these collections are at the same pace...
This paper describes an approach for identifying translations of books in large scanned book collect...
This thesis deals with the problematics of detecting documents, which are so similair one to another...
This thesis deals with the problematics of detecting documents, which are so similair one to another...
With the rapid development of the World Wide Web, there are a huge number of fully or fragmentally d...
The paper describes a fault-tolerant method of selecting duplicate bibliographic records in catalogu...
Περιέχει το πλήρες κείμενοPurpose - The purpose of this paper is to focus on duplicate record detect...
The ever-growing amounts of textual information coming from different sources have fostered the deve...
Digital preservation workflows for image collections involving automatic and semi-automatic image ac...
The ever-growing amounts of textual information coming from different sources have fostered the deve...
Abstract. Near-duplicates are abundant in short text databases. Detecting and eliminating them is of...
This paper introduces a framework for clarifying and formalizing the duplicate document detection pr...
Although much work has been done on duplicate document detection (DDD) and its applications, we obse...
Millions of books from public libraries and private collections have been scanned by various organiz...
As the the amount of books available online the sizes of each these collections are at the same pace...
As the the amount of books available online the sizes of each these collections are at the same pace...
This paper describes an approach for identifying translations of books in large scanned book collect...
This thesis deals with the problematics of detecting documents, which are so similair one to another...
This thesis deals with the problematics of detecting documents, which are so similair one to another...
With the rapid development of the World Wide Web, there are a huge number of fully or fragmentally d...
The paper describes a fault-tolerant method of selecting duplicate bibliographic records in catalogu...
Περιέχει το πλήρες κείμενοPurpose - The purpose of this paper is to focus on duplicate record detect...
The ever-growing amounts of textual information coming from different sources have fostered the deve...
Digital preservation workflows for image collections involving automatic and semi-automatic image ac...
The ever-growing amounts of textual information coming from different sources have fostered the deve...
Abstract. Near-duplicates are abundant in short text databases. Detecting and eliminating them is of...
This paper introduces a framework for clarifying and formalizing the duplicate document detection pr...
Although much work has been done on duplicate document detection (DDD) and its applications, we obse...