Traditional information extraction methods mainly rely on visual feature assisted techniques; but without considering the hierarchical dependencies within the paragraph structure, some important information is missing. This paper proposes an integrated approach for extracting academic information from conference Web pages. Firstly, Web pages are segmented into text blocks by applying a new hybrid page segmentation algorithm which combines visual feature and DOM structure together. Then, these text blocks are labeled by a Tree-structured Random Fields model, and the block functions are differentiated using various features such as visual features, semantic features and hierarchical dependencies. Finally, an additional post-processing is intr...
This work aims to provide a page segmentation algorithm which uses both visual and content informati...
Thesis (Ph.D.)--University of Washington, 2021The World Wide Web contains countless semi-structured ...
In contrast to traditional document retrieval, a web page as a whole is not a good information unit ...
Traditional information extraction methods mainly rely on visual feature assisted techniques; but wi...
Abstract. This paper proposes an automatic method for extracting information from academic conferenc...
We address the problem of academic conference homepage understanding for the Semantic Web. This prob...
Apart from the main content blocks, almost all web pages on the Internet contain such blocks as navi...
As web sites are getting more complicated, the construction of web information extraction systems be...
We describe a feature-rich conditional random field model for the extraction of conference and works...
As web sites are getting more complicated,the construction of web information extractionsystems beco...
International audienceThis paper presents experiments using an algorithm of web page topic segmentat...
In this thesis, we address the challenge of information extraction on the Web. We propose a new web ...
There is a large amount of data available on the Web. Data are often represented as text, enriched w...
Abstract World Wide Web is transforming itself into the largest information re-source making the pro...
Abstract – Content of the web page is the textual and graphical information that related to the topi...
This work aims to provide a page segmentation algorithm which uses both visual and content informati...
Thesis (Ph.D.)--University of Washington, 2021The World Wide Web contains countless semi-structured ...
In contrast to traditional document retrieval, a web page as a whole is not a good information unit ...
Traditional information extraction methods mainly rely on visual feature assisted techniques; but wi...
Abstract. This paper proposes an automatic method for extracting information from academic conferenc...
We address the problem of academic conference homepage understanding for the Semantic Web. This prob...
Apart from the main content blocks, almost all web pages on the Internet contain such blocks as navi...
As web sites are getting more complicated, the construction of web information extraction systems be...
We describe a feature-rich conditional random field model for the extraction of conference and works...
As web sites are getting more complicated,the construction of web information extractionsystems beco...
International audienceThis paper presents experiments using an algorithm of web page topic segmentat...
In this thesis, we address the challenge of information extraction on the Web. We propose a new web ...
There is a large amount of data available on the Web. Data are often represented as text, enriched w...
Abstract World Wide Web is transforming itself into the largest information re-source making the pro...
Abstract – Content of the web page is the textual and graphical information that related to the topi...
This work aims to provide a page segmentation algorithm which uses both visual and content informati...
Thesis (Ph.D.)--University of Washington, 2021The World Wide Web contains countless semi-structured ...
In contrast to traditional document retrieval, a web page as a whole is not a good information unit ...