This paper describes crawling and corpus processing in a distributed framework. We present new tools that build upon existing tools like Heritrix and Hadoop. Further, we propose a general workflow for harvesting, cleaning and processing web data from entire top-level domains in order to produce high-quality monolingual corpora using the least amount of language-specific data. We demonstrate the utility of the infrastructure by producing corpora for two under-resourced languages. Web corpus production for targeted languages and/or domains thus becomes feasible for anyone
We present a widely applicable methodology to bring machine translation (MT) to under-resourced lang...
This paper presents SwissCrawl, the largest Swiss German text corpus to date. Composed of more than ...
This paper describes the Translational English Corpus (TEC) and the software tools developed in orde...
In this paper we describe a flexible and portable infrastructure for setting up large monolingual la...
This paper presents an approach for building large monolingual corpora and, at the same time, extrac...
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
Over the last decade, methods of web corpus construction and the evaluation of web corpora have been...
Comunicació presentada a: EACL '06: Eleventh Conference of the European Chapter of the Association f...
The Web contains vast amounts of linguistic data. One key issue for linguists and language technolog...
The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stage...
The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stage...
This work presents a straightforward method for extending or creating in-domain web corpora by focus...
Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks....
Abstract In web corpus construction, crawling is a necessary step, and it is probably the most costl...
We present an extrinsic evaluation of crawlers of parallel corpora from multi-lingual web sites in m...
We present a widely applicable methodology to bring machine translation (MT) to under-resourced lang...
This paper presents SwissCrawl, the largest Swiss German text corpus to date. Composed of more than ...
This paper describes the Translational English Corpus (TEC) and the software tools developed in orde...
In this paper we describe a flexible and portable infrastructure for setting up large monolingual la...
This paper presents an approach for building large monolingual corpora and, at the same time, extrac...
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
Over the last decade, methods of web corpus construction and the evaluation of web corpora have been...
Comunicació presentada a: EACL '06: Eleventh Conference of the European Chapter of the Association f...
The Web contains vast amounts of linguistic data. One key issue for linguists and language technolog...
The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stage...
The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stage...
This work presents a straightforward method for extending or creating in-domain web corpora by focus...
Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks....
Abstract In web corpus construction, crawling is a necessary step, and it is probably the most costl...
We present an extrinsic evaluation of crawlers of parallel corpora from multi-lingual web sites in m...
We present a widely applicable methodology to bring machine translation (MT) to under-resourced lang...
This paper presents SwissCrawl, the largest Swiss German text corpus to date. Composed of more than ...
This paper describes the Translational English Corpus (TEC) and the software tools developed in orde...