This paper presents BlogBuster, a tool for extracting a corpus from the blogosphere. The topic of cleaning arbitrary web pages with the goal of extracting a corpus from web data, suitable for linguistic and language technology research and development, has attracted significant research interest recently. Several general purpose approaches for removing boilerplate have been presented in the literature; however the blogosphere poses additional requirements, such as a finer control over the extracted textual segments in order to accurately identify important elements, i.e. individual blog posts, titles, posting dates or comments. BlogBuster tries to provide such additional details along with boilerplate removal, following a rule-based approac...
International audienceNowadays, blogs cover a large audience and they raised from the underground to...
International audienceNowadays, blogs cover a large audience and they raised from the underground to...
Blogs are one of the most prominent means of communication on the web. Their content, interconnectio...
International audienceMetadata extraction is known to be a problem in general-purpose Web corpora, a...
Blogs are a dynamic communication medium which has been widely established on the web. The BlogForev...
Weblogs, or blogs, are becoming more and more interesting for a wide audience. Millions of personal,...
This report outlines an inquiry into the area of web data extraction, conducted within the context o...
Blogs are one of the most prominent means of communication on the web. Their content, interconnectio...
International audienceMetadata extraction is known to be a problem in general-purpose Web corpora, a...
International audienceWe introduce two corpora gathered on the web and related to computer-mediated ...
International audienceWe introduce two corpora gathered on the web and related to computer-mediated ...
We introduce two corpora gathered on the web and related to computer-mediated communication: blog po...
This chapter introduces information extraction from blog texts. It argues that the classical techniq...
In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based sy...
International audienceNowadays, blogs cover a large audience and they raised from the underground to...
International audienceNowadays, blogs cover a large audience and they raised from the underground to...
International audienceNowadays, blogs cover a large audience and they raised from the underground to...
Blogs are one of the most prominent means of communication on the web. Their content, interconnectio...
International audienceMetadata extraction is known to be a problem in general-purpose Web corpora, a...
Blogs are a dynamic communication medium which has been widely established on the web. The BlogForev...
Weblogs, or blogs, are becoming more and more interesting for a wide audience. Millions of personal,...
This report outlines an inquiry into the area of web data extraction, conducted within the context o...
Blogs are one of the most prominent means of communication on the web. Their content, interconnectio...
International audienceMetadata extraction is known to be a problem in general-purpose Web corpora, a...
International audienceWe introduce two corpora gathered on the web and related to computer-mediated ...
International audienceWe introduce two corpora gathered on the web and related to computer-mediated ...
We introduce two corpora gathered on the web and related to computer-mediated communication: blog po...
This chapter introduces information extraction from blog texts. It argues that the classical techniq...
In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based sy...
International audienceNowadays, blogs cover a large audience and they raised from the underground to...
International audienceNowadays, blogs cover a large audience and they raised from the underground to...
International audienceNowadays, blogs cover a large audience and they raised from the underground to...
Blogs are one of the most prominent means of communication on the web. Their content, interconnectio...