Recent work has shown the effectiveness of leveraging layout and tag-tree structure for segmenting webpages and labeling HTML elements. However, how to effectively segment and label the text contents inside HTML elements is still an open problem. Since many text contents on a webpage are often text fragments and not strictly grammatical, traditional natural language processing techniques, that typically expect grammatical sentences, are no longer directly applicable. In this paper, we examine how to use layout and tag-tree structure in a principled way to help understand text contents on webpages. We propose to segment and label the page structure and the text content of a webpage in a joint discriminative probabilistic model. In this model...
International audienceThis paper presents experiments using an algorithm of web page topic segmentat...
Abstract Extracting web content is to obtain the required data embedded in web pages, usually includ...
Extracting and processing information from Web pages is an important task in many areas like constru...
In this work, we describe a new Web page segmentation method to extract the semantic structure from ...
An important aspect of research for Web information extraction relates to the inference of complex r...
A new web content structure based on visual representation is proposed in this paper. Many web appli...
In this paper we present a simple, robust, accurate and language-independent solution for extracting...
This thesis describes the design and implementation of an algorithm that, using some initial hints f...
this paper we propose a model of a Web site that describes logical structure of contained data. Fur...
Identifying which parts of a Web-page contain target content (e.g., the portion of an online news pa...
Identifying which parts of a Web-page contain target content (e.g., the portion of an online news pa...
Several techniques have been recently proposed to automatically generate Web wrappers, i.e., program...
This work aims to provide a page segmentation algorithm which uses both visual and content informati...
Abstract: Problems statement: Nowadays, many users use web search engines to find and gather informa...
In data-intensive web sites pages are generated by scripts that embed data from a backend database i...
International audienceThis paper presents experiments using an algorithm of web page topic segmentat...
Abstract Extracting web content is to obtain the required data embedded in web pages, usually includ...
Extracting and processing information from Web pages is an important task in many areas like constru...
In this work, we describe a new Web page segmentation method to extract the semantic structure from ...
An important aspect of research for Web information extraction relates to the inference of complex r...
A new web content structure based on visual representation is proposed in this paper. Many web appli...
In this paper we present a simple, robust, accurate and language-independent solution for extracting...
This thesis describes the design and implementation of an algorithm that, using some initial hints f...
this paper we propose a model of a Web site that describes logical structure of contained data. Fur...
Identifying which parts of a Web-page contain target content (e.g., the portion of an online news pa...
Identifying which parts of a Web-page contain target content (e.g., the portion of an online news pa...
Several techniques have been recently proposed to automatically generate Web wrappers, i.e., program...
This work aims to provide a page segmentation algorithm which uses both visual and content informati...
Abstract: Problems statement: Nowadays, many users use web search engines to find and gather informa...
In data-intensive web sites pages are generated by scripts that embed data from a backend database i...
International audienceThis paper presents experiments using an algorithm of web page topic segmentat...
Abstract Extracting web content is to obtain the required data embedded in web pages, usually includ...
Extracting and processing information from Web pages is an important task in many areas like constru...