The Turkish-English parallel corpus MaCoCu-tr-en 1.0 was built by crawling the ".tr" and ".cy" internet top-level domains in 2021, extending the crawl dynamically to other domains as well. All the crawling process was carried out by the MaCoCu crawler (https://github.com/macocu/MaCoCu-crawler). Websites containing documents in both target languages were identified and processed using the tool Bitextor (https://github.com/bitextor/bitextor). Considerable efforts were devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor...
Abstract—A high-quality parallel corpus needs to be man-ually created to achieve good machine transl...
We present a Swedish-Turkish parallel corpus and the automatic annotation procedure with tools that ...
This paper focuses on the description of the corpus «PEST-INTER» in five languages and the process o...
The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" int...
The Macedonian-English parallel corpus MaCoCu-mk-en 1.0 was built by crawling the ".mk" and ".мкд" i...
The Maltese-English parallel corpus MaCoCu-mt-en 1.0 was built by crawling the ".mt" internet top-le...
The Turkish web corpus MaCoCu-tr 1.0 was built by crawling the ".tr" and ".cy" internet top-level do...
The Croatian-English parallel corpus MaCoCu-hr-en 1.0 was built by crawling the ".hr" internet top-l...
The Slovene-English parallel corpus MaCoCu-sl-en 1.0 was built by crawling the ".si" internet top-le...
The Icelandic-English parallel corpus MaCoCu-is-en 1.0 was built by crawling the ".is" internet top-...
The Maltese web corpus MaCoCu-mt 1.0 was built by crawling the ".mt" internet top-level domain in 20...
The Bulgarian web corpus MaCoCu-bg 1.0 was built by crawling the ".bg" and ".бг" internet top-level ...
The Croatian web corpus MaCoCu-hr 1.0 was built by crawling the ".hr" internet top-level domain in 2...
The Icelandic web corpus MaCoCu-is 1.0 was built by crawling the ".is" internet top-level domain in ...
We introduce the project MaCoCu: Massive collection and curation of monolingual and bilingual data: ...
Abstract—A high-quality parallel corpus needs to be man-ually created to achieve good machine transl...
We present a Swedish-Turkish parallel corpus and the automatic annotation procedure with tools that ...
This paper focuses on the description of the corpus «PEST-INTER» in five languages and the process o...
The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" int...
The Macedonian-English parallel corpus MaCoCu-mk-en 1.0 was built by crawling the ".mk" and ".мкд" i...
The Maltese-English parallel corpus MaCoCu-mt-en 1.0 was built by crawling the ".mt" internet top-le...
The Turkish web corpus MaCoCu-tr 1.0 was built by crawling the ".tr" and ".cy" internet top-level do...
The Croatian-English parallel corpus MaCoCu-hr-en 1.0 was built by crawling the ".hr" internet top-l...
The Slovene-English parallel corpus MaCoCu-sl-en 1.0 was built by crawling the ".si" internet top-le...
The Icelandic-English parallel corpus MaCoCu-is-en 1.0 was built by crawling the ".is" internet top-...
The Maltese web corpus MaCoCu-mt 1.0 was built by crawling the ".mt" internet top-level domain in 20...
The Bulgarian web corpus MaCoCu-bg 1.0 was built by crawling the ".bg" and ".бг" internet top-level ...
The Croatian web corpus MaCoCu-hr 1.0 was built by crawling the ".hr" internet top-level domain in 2...
The Icelandic web corpus MaCoCu-is 1.0 was built by crawling the ".is" internet top-level domain in ...
We introduce the project MaCoCu: Massive collection and curation of monolingual and bilingual data: ...
Abstract—A high-quality parallel corpus needs to be man-ually created to achieve good machine transl...
We present a Swedish-Turkish parallel corpus and the automatic annotation procedure with tools that ...
This paper focuses on the description of the corpus «PEST-INTER» in five languages and the process o...