We present an unsupervised approach for harvesting the data ex-posed by a set of structured and partially overlapping data-intensive web sources. Our proposal comes within a formal framework tack-ling two problems: the data extraction problem, to generate ex-traction rules based on the input websites, and the data integration problem, to integrate the extracted data in a unified schema. We introduce an original algorithm, WEIR, to solve the stated problems and formally prove its correctness. WEIR leverages the overlap-ping data among sources to make better decisions both in the data extraction (by pruning rules that do not lead to redundant informa-tion) and in the data integration (by reflecting local properties of a source over the mediat...
We consider the problem of jointly training structured models for extraction from sources whose inst...
This work explores the usage of Linked Data for Web scale Information Extraction and shows encouragi...
International audienceDeep Web (often called hidden web or invisible web) is composed of all the web...
The web contains a huge amount of structured information provided by a large number of web sites. Si...
We consider the problem of jointly training structured mod-els for extraction from multiple web sour...
The Web contains a vast amount of structured information such as HTML tables, HTML lists and deep-we...
The WWW is considered as a collection of heterogeneous information sources available online. However...
This dissertation in a broad sense focuses on understanding the fundamental aspects of building a la...
International audienceThis paper describes a process for mashing heterogeneous data sources based on...
The proliferation of data sources both in the private and public domains (e.g., in enterprise enviro...
A basic step in integration is the identification of linkage points, i.e., finding attributes that a...
An important part of today’s Web is Web databases, in which 80% of the databases are structured data...
The advent of the era of big data on the Web has made automatic web information extraction an essent...
Abstract: Many large web sites contain highly valuable information. Their pages are dynamically gene...
The World Wide Web contains a huge amount of unstructured and semi-structured information, that is e...
We consider the problem of jointly training structured models for extraction from sources whose inst...
This work explores the usage of Linked Data for Web scale Information Extraction and shows encouragi...
International audienceDeep Web (often called hidden web or invisible web) is composed of all the web...
The web contains a huge amount of structured information provided by a large number of web sites. Si...
We consider the problem of jointly training structured mod-els for extraction from multiple web sour...
The Web contains a vast amount of structured information such as HTML tables, HTML lists and deep-we...
The WWW is considered as a collection of heterogeneous information sources available online. However...
This dissertation in a broad sense focuses on understanding the fundamental aspects of building a la...
International audienceThis paper describes a process for mashing heterogeneous data sources based on...
The proliferation of data sources both in the private and public domains (e.g., in enterprise enviro...
A basic step in integration is the identification of linkage points, i.e., finding attributes that a...
An important part of today’s Web is Web databases, in which 80% of the databases are structured data...
The advent of the era of big data on the Web has made automatic web information extraction an essent...
Abstract: Many large web sites contain highly valuable information. Their pages are dynamically gene...
The World Wide Web contains a huge amount of unstructured and semi-structured information, that is e...
We consider the problem of jointly training structured models for extraction from sources whose inst...
This work explores the usage of Linked Data for Web scale Information Extraction and shows encouragi...
International audienceDeep Web (often called hidden web or invisible web) is composed of all the web...