Blogs are one of the most prominent means of communication on the web. Their content, interconnections and influence constitute a unique socio-technical artefact of our times which needs to be preserved. The BlogForever project has established best practices and developed an innovative system to harvest, preserve, manage and reuse blog content. This paper presents the latest developments of the blog crawler which is a key component of the BlogForever platform. More precisely, our work concentrates on techniques to automatically extract content such as articles, authors, dates and comments from blog posts. To achieve this goal, we introduce a simple yet robust and scalable algorithm to generate extraction rules based on string matching using...
The internet is a great source of database that contains valuable information of all kind. Weblogs ...
In order to perform analysis over weblogs, we must first iden-tify the appropriate unit of a weblog ...
Abstract — The massive adoption of social media has provided new ways for individuals to express the...
Blogs are one of the most prominent means of communication on the web. Their content, interconnectio...
Blogs are a dynamic communication medium which has been widely established on the web. The BlogForev...
This report outlines an inquiry into the area of web data extraction, conducted within the context o...
This paper presents BlogBuster, a tool for extracting a corpus from the blogosphere. The topic of cl...
This chapter introduces information extraction from blog texts. It argues that the classical techniq...
We present a system that tries to automatically collect and monitor Japanese blog collections that i...
Blog archiving and preservation is not a new challenge. Current solutions are commonly based on typi...
Blog archiving and preservation is not a new challenge. Current solutions are commonly based on typi...
International audienceMetadata extraction is known to be a problem in general-purpose Web corpora, a...
This paper proposes a fully automated information extraction methodology for weblogs. The methodolog...
Blog archiving and preservation is not a new challenge. Current solutions are commonly based on typi...
User generated content forms an important domain for mining knowledge. In this paper, we address the...
The internet is a great source of database that contains valuable information of all kind. Weblogs ...
In order to perform analysis over weblogs, we must first iden-tify the appropriate unit of a weblog ...
Abstract — The massive adoption of social media has provided new ways for individuals to express the...
Blogs are one of the most prominent means of communication on the web. Their content, interconnectio...
Blogs are a dynamic communication medium which has been widely established on the web. The BlogForev...
This report outlines an inquiry into the area of web data extraction, conducted within the context o...
This paper presents BlogBuster, a tool for extracting a corpus from the blogosphere. The topic of cl...
This chapter introduces information extraction from blog texts. It argues that the classical techniq...
We present a system that tries to automatically collect and monitor Japanese blog collections that i...
Blog archiving and preservation is not a new challenge. Current solutions are commonly based on typi...
Blog archiving and preservation is not a new challenge. Current solutions are commonly based on typi...
International audienceMetadata extraction is known to be a problem in general-purpose Web corpora, a...
This paper proposes a fully automated information extraction methodology for weblogs. The methodolog...
Blog archiving and preservation is not a new challenge. Current solutions are commonly based on typi...
User generated content forms an important domain for mining knowledge. In this paper, we address the...
The internet is a great source of database that contains valuable information of all kind. Weblogs ...
In order to perform analysis over weblogs, we must first iden-tify the appropriate unit of a weblog ...
Abstract — The massive adoption of social media has provided new ways for individuals to express the...