Our study identifies sentences in Wikipedia articles that are either identical or highly similar by applying tech-niques for near-duplicate detection of web pages. This is accomplished with a MapReduce implementation of minhash to identify clusters of sentences with high Jac-card similarity. We show that these clusters can be cat-egorized into six different types, two of which are par-ticularly interesting: identical sentences quantify the ex-tent to which content in Wikipedia is copied and pasted, and near-duplicate sentences that state contradictory facts point to quality issues in Wikipedia
Abstract. The online encyclopedia Wikipedia gives rise to a multitude of network structures such as ...
The paper addresses the problem of modeling the relationship between phrases in English using a simi...
AbstractParallel sentences are a relatively scarce but extremely useful resource for many applicatio...
Duplicate and near-duplicate web pages are the chief concerns for web search engines. In reality, th...
Motivation: Document similarity metrics such as PubMed’s “Find related articles ” feature, which hav...
Part 1: ConferenceInternational audienceNear duplicate documents and their detection are studied to ...
Multiple approaches to grab comparable data from the Web have been developed up to date. Neverthele...
Multiple approaches to grab comparable data from the Web have been developed up to date. Neverthele...
We test the hypothesis that the extent to which one obtains information on a given topic through Wik...
This paper expands on a 1997 study of the amount and distri-bution of near-duplicate pages on the Wo...
Giesen J, Kahlmeyer P, Nussbaum F, Zarrieß S. Leveraging the Wikipedia Graph for Evaluating Word Emb...
The article presents experiments on mining Wikipedia for extracting SMT useful sentence pairs in thr...
Abstract. The mathematical concept of document resemblance cap-tures well the informal notion of syn...
Two years ago, we conducted a study on the evolution of web pages over time. In the course of that s...
A group of documents is called near-duplicates if they are almost the same with just a slight differ...
Abstract. The online encyclopedia Wikipedia gives rise to a multitude of network structures such as ...
The paper addresses the problem of modeling the relationship between phrases in English using a simi...
AbstractParallel sentences are a relatively scarce but extremely useful resource for many applicatio...
Duplicate and near-duplicate web pages are the chief concerns for web search engines. In reality, th...
Motivation: Document similarity metrics such as PubMed’s “Find related articles ” feature, which hav...
Part 1: ConferenceInternational audienceNear duplicate documents and their detection are studied to ...
Multiple approaches to grab comparable data from the Web have been developed up to date. Neverthele...
Multiple approaches to grab comparable data from the Web have been developed up to date. Neverthele...
We test the hypothesis that the extent to which one obtains information on a given topic through Wik...
This paper expands on a 1997 study of the amount and distri-bution of near-duplicate pages on the Wo...
Giesen J, Kahlmeyer P, Nussbaum F, Zarrieß S. Leveraging the Wikipedia Graph for Evaluating Word Emb...
The article presents experiments on mining Wikipedia for extracting SMT useful sentence pairs in thr...
Abstract. The mathematical concept of document resemblance cap-tures well the informal notion of syn...
Two years ago, we conducted a study on the evolution of web pages over time. In the course of that s...
A group of documents is called near-duplicates if they are almost the same with just a slight differ...
Abstract. The online encyclopedia Wikipedia gives rise to a multitude of network structures such as ...
The paper addresses the problem of modeling the relationship between phrases in English using a simi...
AbstractParallel sentences are a relatively scarce but extremely useful resource for many applicatio...