Top-Level Domain Crawling for Producing Comprehensive Monolingual Corpora from the Web

Dirk Goldhahn
Steffen Remus
Uwe Quasthoff
Chris Biemann

Publication date

January 2016

Abstract

This paper describes crawling and corpus processing in a distributed framework. We present new tools that build upon existing tools like Heritrix and Hadoop. Further, we propose a general workflow for harvesting, cleaning and processing web data from entire top-level domains in order to produce high-quality monolingual corpora using the least amount of language-specific data. We demonstrate the utility of the infrastructure by producing corpora for two under-resourced languages. Web corpus production for targeted languages and/or domains thus becomes feasible for anyone

Extracted data

We use cookies to provide a better user experience.

Data Protection

Top-Level Domain Crawling for Producing Comprehensive Monolingual Corpora from the Web

Abstract

Extracted data

Top-Level Domain Crawling for Producing Comprehensive Monolingual Corpora from the Web

Abstract

Extracted data

Related items

Related items