Extracting and processing information from Web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a "bag of words" and then to perform additional processing on such a flat representation. In this paper we propose, a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Using visual information one is able to define heuristics for the recognition of common page areas such as header, left and right menu, footer and center of a page. We show in initial experiments that using our heuristics defined objects are recognized properly in 73% of cases. Fin...
Wrappers are tools used to extract relevant information from HTML pages. Current approaches use DOM ...
Due to the explosive growth and popularity of the deep web, information extraction from deep web pag...
There is a large amount of data available on the Web. Data are often represented as text, enriched w...
Extracting and processing information from Web pages is an important task in many areas like constru...
Extracting and processing information from Web pages is an important task in many areas like constru...
Extracting and processing information from web pages is an important task in many areas like constru...
Extracting and processing information from web pages is an important task in many areas like constru...
This work aims to provide a page segmentation algorithm which uses both visual and content informati...
AbstractDespite the exponential WWW growth and the success of the Semantic Web, there is limited sup...
A new web content structure based on visual representation is proposed in this paper. Many web appli...
muenchen.de When automatically extracting information from the world wide web, most established meth...
In contrast to traditional document retrieval, a web page as a whole is not a good information unit ...
Several techniques have been recently proposed to automatically generate Web wrappers, i.e., program...
International audienceThis paper presents experiments using an algorithm of web page topic segmentat...
As web sites are getting more complicated,the construction of web information extractionsystems beco...
Wrappers are tools used to extract relevant information from HTML pages. Current approaches use DOM ...
Due to the explosive growth and popularity of the deep web, information extraction from deep web pag...
There is a large amount of data available on the Web. Data are often represented as text, enriched w...
Extracting and processing information from Web pages is an important task in many areas like constru...
Extracting and processing information from Web pages is an important task in many areas like constru...
Extracting and processing information from web pages is an important task in many areas like constru...
Extracting and processing information from web pages is an important task in many areas like constru...
This work aims to provide a page segmentation algorithm which uses both visual and content informati...
AbstractDespite the exponential WWW growth and the success of the Semantic Web, there is limited sup...
A new web content structure based on visual representation is proposed in this paper. Many web appli...
muenchen.de When automatically extracting information from the world wide web, most established meth...
In contrast to traditional document retrieval, a web page as a whole is not a good information unit ...
Several techniques have been recently proposed to automatically generate Web wrappers, i.e., program...
International audienceThis paper presents experiments using an algorithm of web page topic segmentat...
As web sites are getting more complicated,the construction of web information extractionsystems beco...
Wrappers are tools used to extract relevant information from HTML pages. Current approaches use DOM ...
Due to the explosive growth and popularity of the deep web, information extraction from deep web pag...
There is a large amount of data available on the Web. Data are often represented as text, enriched w...