The dataset comprises text and metadata extracted from several hundred IT-blogs and websites, along with a method to duplicate the data by updating its contents and downloading it to the user’s local machine. The targets have been hand-picked with the intention to represent the discourse on blogs and websites dedicated to questions at the intersection of technology and society from Germany and the United States of America. The texts have been retrieved by web crawling techniques. The resulting corpus is accessible through a search platform and also reproducible with freely accessible descriptors and software
International audienceAs the Web ought to be considered as a series of sources rather than as a sour...
To create the corpus, first we download from Reuters website 27,000 random news articles (HTML webp...
International audienceAs the Web ought to be considered as a series of sources rather than as a sour...
The dataset entail homepages for several hundred IT-blogs and websites which have been hand-picked w...
This paper presents BlogBuster, a tool for extracting a corpus from the blogosphere. The topic of cl...
International audienceFollowing the assumption that the tech blog sphere represents an avant-garde o...
International audienceFollowing the assumption that the tech blog sphere represents an avant-garde o...
Short paper talk at RESAW 2015 conference (Aarhus, Denmark).International audienceI would like to pr...
International audienceWe introduce two corpora gathered on the web and related to computer-mediated ...
International audienceWe introduce two corpora gathered on the web and related to computer-mediated ...
We introduce two corpora gathered on the web and related to computer-mediated communication: blog po...
International audienceMetadata extraction is known to be a problem in general-purpose Web corpora, a...
In recent years, linguists have become increasingly interested in the language of the Internet—both ...
In recent years, linguists have become increasingly interested in the language of the Internet—both ...
The present paper reports the first results of the compilation and annotation of a blog corpus for G...
International audienceAs the Web ought to be considered as a series of sources rather than as a sour...
To create the corpus, first we download from Reuters website 27,000 random news articles (HTML webp...
International audienceAs the Web ought to be considered as a series of sources rather than as a sour...
The dataset entail homepages for several hundred IT-blogs and websites which have been hand-picked w...
This paper presents BlogBuster, a tool for extracting a corpus from the blogosphere. The topic of cl...
International audienceFollowing the assumption that the tech blog sphere represents an avant-garde o...
International audienceFollowing the assumption that the tech blog sphere represents an avant-garde o...
Short paper talk at RESAW 2015 conference (Aarhus, Denmark).International audienceI would like to pr...
International audienceWe introduce two corpora gathered on the web and related to computer-mediated ...
International audienceWe introduce two corpora gathered on the web and related to computer-mediated ...
We introduce two corpora gathered on the web and related to computer-mediated communication: blog po...
International audienceMetadata extraction is known to be a problem in general-purpose Web corpora, a...
In recent years, linguists have become increasingly interested in the language of the Internet—both ...
In recent years, linguists have become increasingly interested in the language of the Internet—both ...
The present paper reports the first results of the compilation and annotation of a blog corpus for G...
International audienceAs the Web ought to be considered as a series of sources rather than as a sour...
To create the corpus, first we download from Reuters website 27,000 random news articles (HTML webp...
International audienceAs the Web ought to be considered as a series of sources rather than as a sour...