Abstract. Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a “bag of words ” and then to perform an additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes the browser screen coordinates for every HTML object in a page. Using a spatial information one is able to define heuristics for recognition of common page areas such as a header, left and right menu, footer and the center of a page. We show in initial experiments that using our heuristics, defined objects are recognized proper...
As web sites are getting more complicated,the construction of web information extractionsystems beco...
There is a large amount of data available on the Web. Data are often represented as text, enriched w...
In contrast to traditional document retrieval, a web page as a whole is not a good information unit ...
Extracting and processing information from web pages is an important task in many areas like constru...
Extracting and processing information from Web pages is an important task in many areas like constru...
Extracting and processing information from Web pages is an important task in many areas like constru...
Extracting and processing information from web pages is an important task in many areas like constru...
Tables on web pages contain a huge amount of seman-tically explicit information, which makes them a ...
Tables on web pages contain a huge amount of seman-tically explicit information, which makes them a ...
This work aims to provide a page segmentation algorithm which uses both visual and content informati...
AbstractDespite the exponential WWW growth and the success of the Semantic Web, there is limited sup...
Due to the explosive growth and popularity of the deep web, information extraction from deep web pag...
A new web content structure based on visual representation is proposed in this paper. Many web appli...
International audienceThis paper presents experiments using an algorithm of web page topic segmentat...
Wrappers are tools used to extract relevant information from HTML pages. Current approaches use DOM ...
As web sites are getting more complicated,the construction of web information extractionsystems beco...
There is a large amount of data available on the Web. Data are often represented as text, enriched w...
In contrast to traditional document retrieval, a web page as a whole is not a good information unit ...
Extracting and processing information from web pages is an important task in many areas like constru...
Extracting and processing information from Web pages is an important task in many areas like constru...
Extracting and processing information from Web pages is an important task in many areas like constru...
Extracting and processing information from web pages is an important task in many areas like constru...
Tables on web pages contain a huge amount of seman-tically explicit information, which makes them a ...
Tables on web pages contain a huge amount of seman-tically explicit information, which makes them a ...
This work aims to provide a page segmentation algorithm which uses both visual and content informati...
AbstractDespite the exponential WWW growth and the success of the Semantic Web, there is limited sup...
Due to the explosive growth and popularity of the deep web, information extraction from deep web pag...
A new web content structure based on visual representation is proposed in this paper. Many web appli...
International audienceThis paper presents experiments using an algorithm of web page topic segmentat...
Wrappers are tools used to extract relevant information from HTML pages. Current approaches use DOM ...
As web sites are getting more complicated,the construction of web information extractionsystems beco...
There is a large amount of data available on the Web. Data are often represented as text, enriched w...
In contrast to traditional document retrieval, a web page as a whole is not a good information unit ...