Identifying which parts of a Web-page contain target content (e.g., the portion of an online news page that contains the actual article) is a significant problem that must be addressed for many Webbased applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle: they fail to properly extract content in some cases and break when the structure of a site’s Web-pages changes. In this work we treat the problem of identifying content as a sequence labeling problem, a common problem structure in machine learning and natural language process...
Abstract As the World Wide Web grows at an unprecedented pace, web page genre identification has rec...
Abstract — World Wide Web (WWW) is now a famous medium by which people all around the world can spre...
In this paper we address the problem of unsupervised Web data extraction. We show that unsupervised ...
Identifying which parts of a Web-page contain target content (e.g., the portion of an online news pa...
Web pages consist of not only actual content, but also other ele-ments such as branding banners, nav...
Abstract. Intelligent information processing systems, such as digital libraries or search engines in...
Recent work has shown the effectiveness of leveraging layout and tag-tree structure for segmenting w...
The web is recognized as the largest data source in the world. The nature of such data is characteri...
We consider the problem of content extraction from on-line news webpages. To explore to what extent ...
Apart from the main content blocks, almost all web pages on the Internet contain such blocks as navi...
This work aims to use machine learning techniques for the classification of specific parts of web pa...
In this paper we present a simple, robust, accurate and language-independent solution for extracting...
In this work, we describe a new Web page segmentation method to extract the semantic structure from ...
At present, information systems combining crawling and information extraction (IE) technologies acqu...
Having focused in earlier chapters on the general structure of the Web, in this chapter we will disc...
Abstract As the World Wide Web grows at an unprecedented pace, web page genre identification has rec...
Abstract — World Wide Web (WWW) is now a famous medium by which people all around the world can spre...
In this paper we address the problem of unsupervised Web data extraction. We show that unsupervised ...
Identifying which parts of a Web-page contain target content (e.g., the portion of an online news pa...
Web pages consist of not only actual content, but also other ele-ments such as branding banners, nav...
Abstract. Intelligent information processing systems, such as digital libraries or search engines in...
Recent work has shown the effectiveness of leveraging layout and tag-tree structure for segmenting w...
The web is recognized as the largest data source in the world. The nature of such data is characteri...
We consider the problem of content extraction from on-line news webpages. To explore to what extent ...
Apart from the main content blocks, almost all web pages on the Internet contain such blocks as navi...
This work aims to use machine learning techniques for the classification of specific parts of web pa...
In this paper we present a simple, robust, accurate and language-independent solution for extracting...
In this work, we describe a new Web page segmentation method to extract the semantic structure from ...
At present, information systems combining crawling and information extraction (IE) technologies acqu...
Having focused in earlier chapters on the general structure of the Web, in this chapter we will disc...
Abstract As the World Wide Web grows at an unprecedented pace, web page genre identification has rec...
Abstract — World Wide Web (WWW) is now a famous medium by which people all around the world can spre...
In this paper we address the problem of unsupervised Web data extraction. We show that unsupervised ...