Data-driven websites are mostly accessed through search interfaces. Such sites follow a common publishing pattern that, surprisingly, has not been fully exploited for unsupervised data extraction yet: the result of a search is presented as a paginated list of result records. Each result record contains the main attributes about one single object, and links to a page dedicated to the details of that object. We present red, an automatic approach and a prototype system for extracting data records from sites following this publishing pattern. red leverages the inherent redundancy between result records and corresponding detail pages to design an effective, yet fully-unsupervised and domain-independent method. It is able to extract from result ...
Categories and Subject Descriptors: H.3.4 [Systems and Software]: Performance evaluation (efficiency...
Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples,...
Abstract. Extracting data from web pages using wrappers is a fundamental problem arising in a large ...
Data-driven websites are mostly accessed through search interfaces. Such sites follow a common publi...
Data-driven websites are mostly accessed through search interfaces. Such sites follow a common publi...
Web extraction is the task of turning unstructured HTML into structured data. Previous approaches re...
Web extraction is the task of turning unstructured HTML into structured data. Previous approaches re...
Online databases respond to a user query with result records encoded in HTML files. Data extraction,...
Abstract — With the help of HTML form-based search interfaces, a large number of databases have beco...
Search results generated by searchable databases are served dynamically and far larger than the stat...
The advent of the era of big data on the Web has made automatic web information extraction an essen...
The thesis treats automatic extraction of semantic data from Web pages. Within this broad problem, i...
The advent of the era of big data on the Web has made automatic web information extraction an essent...
The web is the greatest information source in human history, yet finding all offers for flats with g...
The web is the greatest information source in human history, yet finding all offers for flats with g...
Categories and Subject Descriptors: H.3.4 [Systems and Software]: Performance evaluation (efficiency...
Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples,...
Abstract. Extracting data from web pages using wrappers is a fundamental problem arising in a large ...
Data-driven websites are mostly accessed through search interfaces. Such sites follow a common publi...
Data-driven websites are mostly accessed through search interfaces. Such sites follow a common publi...
Web extraction is the task of turning unstructured HTML into structured data. Previous approaches re...
Web extraction is the task of turning unstructured HTML into structured data. Previous approaches re...
Online databases respond to a user query with result records encoded in HTML files. Data extraction,...
Abstract — With the help of HTML form-based search interfaces, a large number of databases have beco...
Search results generated by searchable databases are served dynamically and far larger than the stat...
The advent of the era of big data on the Web has made automatic web information extraction an essen...
The thesis treats automatic extraction of semantic data from Web pages. Within this broad problem, i...
The advent of the era of big data on the Web has made automatic web information extraction an essent...
The web is the greatest information source in human history, yet finding all offers for flats with g...
The web is the greatest information source in human history, yet finding all offers for flats with g...
Categories and Subject Descriptors: H.3.4 [Systems and Software]: Performance evaluation (efficiency...
Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples,...
Abstract. Extracting data from web pages using wrappers is a fundamental problem arising in a large ...