Extraction and integration of partially overlapping web sources

Mirko Bronzi
Valter Crescenzi
Paolo Merialdo
Paolo Papotti

Publication date

January 2012

Abstract

We present an unsupervised approach for harvesting the data ex-posed by a set of structured and partially overlapping data-intensive web sources. Our proposal comes within a formal framework tack-ling two problems: the data extraction problem, to generate ex-traction rules based on the input websites, and the data integration problem, to integrate the extracted data in a unified schema. We introduce an original algorithm, WEIR, to solve the stated problems and formally prove its correctness. WEIR leverages the overlap-ping data among sources to make better decisions both in the data extraction (by pruning rules that do not lead to redundant informa-tion) and in the data integration (by reflecting local properties of a source over the mediat...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Extraction and integration of partially overlapping web sources

Abstract

Extracted data

Extraction and integration of partially overlapping web sources

Abstract

Extracted data

Related items

Related items