Identifying duplicate and contradictory information in wikipedia

Sarah Weissman
Samet Ayhan
Joshua Bradley
Jimmy Lin

Publication date

August 2016

Abstract

Our study identifies sentences in Wikipedia articles that are either identical or highly similar by applying tech-niques for near-duplicate detection of web pages. This is accomplished with a MapReduce implementation of minhash to identify clusters of sentences with high Jac-card similarity. We show that these clusters can be cat-egorized into six different types, two of which are par-ticularly interesting: identical sentences quantify the ex-tent to which content in Wikipedia is copied and pasted, and near-duplicate sentences that state contradictory facts point to quality issues in Wikipedia

Extracted data

We use cookies to provide a better user experience.

Data Protection

Identifying duplicate and contradictory information in wikipedia

Abstract

Extracted data

Identifying duplicate and contradictory information in wikipedia

Abstract

Extracted data

Related items

Related items