We introduce two corpora gathered on the web and related to computer-mediated communication: blog posts and blog comments. In order to build such corpora, we addressed following issues: website discovery and crawling, content extraction constraints, and text quality assessment. The blogs were manually classified as to their license and content type. Our results show that it is possible to find blogs in German under Creative Commons license, and that it is possible to perform text extraction and linguistic annotation efficiently enough to allow for a comparison with more traditional text types such as newspaper corpora and subtitles. The comparison gives insights on distributional properties of the processed web texts on token and type level...
The dataset entail homepages for several hundred IT-blogs and websites which have been hand-picked w...
International audienceNowadays, blogs cover a large audience and they raised from the underground to...
International audienceNowadays, blogs cover a large audience and they raised from the underground to...
International audienceWe introduce two corpora gathered on the web and related to computer-mediated ...
International audienceWe introduce two corpora gathered on the web and related to computer-mediated ...
Short paper talk at RESAW 2015 conference (Aarhus, Denmark).International audienceI would like to pr...
The present paper reports the first results of the compilation and annotation of a blog corpus for G...
The present paper reports the first results of the compilation and annotation of a blog corpus for G...
Short paper talk at RESAW 2015 conference (Aarhus, Denmark).International audienceI would like to pr...
The present paper reports the first results of the compilation and annotation of a blog corpus for G...
This paper presents BlogBuster, a tool for extracting a corpus from the blogosphere. The topic of cl...
International audienceMetadata extraction is known to be a problem in general-purpose Web corpora, a...
Previously, linguistic text analysis is performed manually. However, today there are new methodologi...
Previously, linguistic text analysis is performed manually. However, today there are new methodologi...
Weblogs, or blogs, are becoming more and more interesting for a wide audience. Millions of personal,...
The dataset entail homepages for several hundred IT-blogs and websites which have been hand-picked w...
International audienceNowadays, blogs cover a large audience and they raised from the underground to...
International audienceNowadays, blogs cover a large audience and they raised from the underground to...
International audienceWe introduce two corpora gathered on the web and related to computer-mediated ...
International audienceWe introduce two corpora gathered on the web and related to computer-mediated ...
Short paper talk at RESAW 2015 conference (Aarhus, Denmark).International audienceI would like to pr...
The present paper reports the first results of the compilation and annotation of a blog corpus for G...
The present paper reports the first results of the compilation and annotation of a blog corpus for G...
Short paper talk at RESAW 2015 conference (Aarhus, Denmark).International audienceI would like to pr...
The present paper reports the first results of the compilation and annotation of a blog corpus for G...
This paper presents BlogBuster, a tool for extracting a corpus from the blogosphere. The topic of cl...
International audienceMetadata extraction is known to be a problem in general-purpose Web corpora, a...
Previously, linguistic text analysis is performed manually. However, today there are new methodologi...
Previously, linguistic text analysis is performed manually. However, today there are new methodologi...
Weblogs, or blogs, are becoming more and more interesting for a wide audience. Millions of personal,...
The dataset entail homepages for several hundred IT-blogs and websites which have been hand-picked w...
International audienceNowadays, blogs cover a large audience and they raised from the underground to...
International audienceNowadays, blogs cover a large audience and they raised from the underground to...