The Croatian-English parallel corpus MaCoCu-hr-en 1.0 was built by crawling the ".hr" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. All the crawling process was carried out by the MaCoCu crawler (https://github.com/macocu/MaCoCu-crawler). Websites containing documents in both target languages were identified and processed using the tool Bitextor (https://github.com/bitextor/bitextor). Considerable efforts were devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor were carr...
The Croatian web corpus hrWaC was built by crawling the .hr top-level domain in 2011 and again in 20...
The paper presents the methodology and the outcome of the compilation and the processing of the Bulg...
The srenWaC corpus consists of sentence aligned parallel Serbian-English texts crawled from the .rs ...
The Slovene-English parallel corpus MaCoCu-sl-en 1.0 was built by crawling the ".si" internet top-le...
The Macedonian-English parallel corpus MaCoCu-mk-en 1.0 was built by crawling the ".mk" and ".мкд" i...
The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" int...
The Croatian web corpus MaCoCu-hr 1.0 was built by crawling the ".hr" internet top-level domain in 2...
The Maltese-English parallel corpus MaCoCu-mt-en 1.0 was built by crawling the ".mt" internet top-le...
The Turkish-English parallel corpus MaCoCu-tr-en 1.0 was built by crawling the ".tr" and ".cy" inter...
The Icelandic-English parallel corpus MaCoCu-is-en 1.0 was built by crawling the ".is" internet top-...
The Bulgarian web corpus MaCoCu-bg 1.0 was built by crawling the ".bg" and ".бг" internet top-level ...
The Maltese web corpus MaCoCu-mt 1.0 was built by crawling the ".mt" internet top-level domain in 20...
The Turkish web corpus MaCoCu-tr 1.0 was built by crawling the ".tr" and ".cy" internet top-level do...
The Icelandic web corpus MaCoCu-is 1.0 was built by crawling the ".is" internet top-level domain in ...
The hrenWaC corpus version 2.0 consists of parallel Croatian-English texts crawled from the .hr top-...
The Croatian web corpus hrWaC was built by crawling the .hr top-level domain in 2011 and again in 20...
The paper presents the methodology and the outcome of the compilation and the processing of the Bulg...
The srenWaC corpus consists of sentence aligned parallel Serbian-English texts crawled from the .rs ...
The Slovene-English parallel corpus MaCoCu-sl-en 1.0 was built by crawling the ".si" internet top-le...
The Macedonian-English parallel corpus MaCoCu-mk-en 1.0 was built by crawling the ".mk" and ".мкд" i...
The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" int...
The Croatian web corpus MaCoCu-hr 1.0 was built by crawling the ".hr" internet top-level domain in 2...
The Maltese-English parallel corpus MaCoCu-mt-en 1.0 was built by crawling the ".mt" internet top-le...
The Turkish-English parallel corpus MaCoCu-tr-en 1.0 was built by crawling the ".tr" and ".cy" inter...
The Icelandic-English parallel corpus MaCoCu-is-en 1.0 was built by crawling the ".is" internet top-...
The Bulgarian web corpus MaCoCu-bg 1.0 was built by crawling the ".bg" and ".бг" internet top-level ...
The Maltese web corpus MaCoCu-mt 1.0 was built by crawling the ".mt" internet top-level domain in 20...
The Turkish web corpus MaCoCu-tr 1.0 was built by crawling the ".tr" and ".cy" internet top-level do...
The Icelandic web corpus MaCoCu-is 1.0 was built by crawling the ".is" internet top-level domain in ...
The hrenWaC corpus version 2.0 consists of parallel Croatian-English texts crawled from the .hr top-...
The Croatian web corpus hrWaC was built by crawling the .hr top-level domain in 2011 and again in 20...
The paper presents the methodology and the outcome of the compilation and the processing of the Bulg...
The srenWaC corpus consists of sentence aligned parallel Serbian-English texts crawled from the .rs ...