The Slovene-English parallel corpus MaCoCu-sl-en 1.0 was built by crawling the ".si" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. All the crawling process was carried out by the MaCoCu crawler (https://github.com/macocu/MaCoCu-crawler). Websites containing documents in both target languages were identified and processed using the tool Bitextor (https://github.com/bitextor/bitextor). Considerable efforts were devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor were carri...
The srenWaC corpus consists of sentence aligned parallel Serbian-English texts crawled from the .rs ...
The paper presents the methodology and the outcome of the compilation and the processing of the Bulg...
The hrenWaC corpus version 2.0 consists of parallel Croatian-English texts crawled from the .hr top-...
The Croatian-English parallel corpus MaCoCu-hr-en 1.0 was built by crawling the ".hr" internet top-l...
The Macedonian-English parallel corpus MaCoCu-mk-en 1.0 was built by crawling the ".mk" and ".мкд" i...
The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" int...
The Maltese-English parallel corpus MaCoCu-mt-en 1.0 was built by crawling the ".mt" internet top-le...
The Turkish-English parallel corpus MaCoCu-tr-en 1.0 was built by crawling the ".tr" and ".cy" inter...
The Icelandic-English parallel corpus MaCoCu-is-en 1.0 was built by crawling the ".is" internet top-...
The Bulgarian web corpus MaCoCu-bg 1.0 was built by crawling the ".bg" and ".бг" internet top-level ...
The Croatian web corpus MaCoCu-hr 1.0 was built by crawling the ".hr" internet top-level domain in 2...
The Maltese web corpus MaCoCu-mt 1.0 was built by crawling the ".mt" internet top-level domain in 20...
The Turkish web corpus MaCoCu-tr 1.0 was built by crawling the ".tr" and ".cy" internet top-level do...
The slenWaC corpus version 1.0 consists of parallel Slovene-English texts crawled from the .si top-l...
The Icelandic web corpus MaCoCu-is 1.0 was built by crawling the ".is" internet top-level domain in ...
The srenWaC corpus consists of sentence aligned parallel Serbian-English texts crawled from the .rs ...
The paper presents the methodology and the outcome of the compilation and the processing of the Bulg...
The hrenWaC corpus version 2.0 consists of parallel Croatian-English texts crawled from the .hr top-...
The Croatian-English parallel corpus MaCoCu-hr-en 1.0 was built by crawling the ".hr" internet top-l...
The Macedonian-English parallel corpus MaCoCu-mk-en 1.0 was built by crawling the ".mk" and ".мкд" i...
The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" int...
The Maltese-English parallel corpus MaCoCu-mt-en 1.0 was built by crawling the ".mt" internet top-le...
The Turkish-English parallel corpus MaCoCu-tr-en 1.0 was built by crawling the ".tr" and ".cy" inter...
The Icelandic-English parallel corpus MaCoCu-is-en 1.0 was built by crawling the ".is" internet top-...
The Bulgarian web corpus MaCoCu-bg 1.0 was built by crawling the ".bg" and ".бг" internet top-level ...
The Croatian web corpus MaCoCu-hr 1.0 was built by crawling the ".hr" internet top-level domain in 2...
The Maltese web corpus MaCoCu-mt 1.0 was built by crawling the ".mt" internet top-level domain in 20...
The Turkish web corpus MaCoCu-tr 1.0 was built by crawling the ".tr" and ".cy" internet top-level do...
The slenWaC corpus version 1.0 consists of parallel Slovene-English texts crawled from the .si top-l...
The Icelandic web corpus MaCoCu-is 1.0 was built by crawling the ".is" internet top-level domain in ...
The srenWaC corpus consists of sentence aligned parallel Serbian-English texts crawled from the .rs ...
The paper presents the methodology and the outcome of the compilation and the processing of the Bulg...
The hrenWaC corpus version 2.0 consists of parallel Croatian-English texts crawled from the .hr top-...