The Maltese web corpus MaCoCu-mt 1.0 was built by crawling the ".mt" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. The crawler is available at https://github.com/macocu/MaCoCu-crawler. Considerable efforts were devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate (https://corpus.tools/wiki/Justext) and near-duplicated paragraphs (https://corpus.tools/wiki/Onion), discarding very short texts as well as texts that are not in the target language. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (https://github.com/bitextor/monotextor), making the corp...
The Catalan Textual Corpus is a 1760-million-token web corpus of Catalan built from several sources:...
The Montenegrin web corpus meWaC was built by crawling the .me top-level domain in 2019. The corpus ...
Over the last decade, methods of web corpus construction and the evaluation of web corpora have been...
The Bulgarian web corpus MaCoCu-bg 1.0 was built by crawling the ".bg" and ".бг" internet top-level ...
The Turkish web corpus MaCoCu-tr 1.0 was built by crawling the ".tr" and ".cy" internet top-level do...
The Maltese-English parallel corpus MaCoCu-mt-en 1.0 was built by crawling the ".mt" internet top-le...
The Icelandic web corpus MaCoCu-is 1.0 was built by crawling the ".is" internet top-level domain in ...
The Croatian web corpus MaCoCu-hr 1.0 was built by crawling the ".hr" internet top-level domain in 2...
The Macedonian-English parallel corpus MaCoCu-mk-en 1.0 was built by crawling the ".mk" and ".мкд" i...
The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" int...
The Turkish-English parallel corpus MaCoCu-tr-en 1.0 was built by crawling the ".tr" and ".cy" inter...
The Croatian-English parallel corpus MaCoCu-hr-en 1.0 was built by crawling the ".hr" internet top-l...
The Slovene-English parallel corpus MaCoCu-sl-en 1.0 was built by crawling the ".si" internet top-le...
The Icelandic-English parallel corpus MaCoCu-is-en 1.0 was built by crawling the ".is" internet top-...
The Bosnian web corpus bsWaC was built by crawling the .ba top-level domain in 2014. The corpus was ...
The Catalan Textual Corpus is a 1760-million-token web corpus of Catalan built from several sources:...
The Montenegrin web corpus meWaC was built by crawling the .me top-level domain in 2019. The corpus ...
Over the last decade, methods of web corpus construction and the evaluation of web corpora have been...
The Bulgarian web corpus MaCoCu-bg 1.0 was built by crawling the ".bg" and ".бг" internet top-level ...
The Turkish web corpus MaCoCu-tr 1.0 was built by crawling the ".tr" and ".cy" internet top-level do...
The Maltese-English parallel corpus MaCoCu-mt-en 1.0 was built by crawling the ".mt" internet top-le...
The Icelandic web corpus MaCoCu-is 1.0 was built by crawling the ".is" internet top-level domain in ...
The Croatian web corpus MaCoCu-hr 1.0 was built by crawling the ".hr" internet top-level domain in 2...
The Macedonian-English parallel corpus MaCoCu-mk-en 1.0 was built by crawling the ".mk" and ".мкд" i...
The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" int...
The Turkish-English parallel corpus MaCoCu-tr-en 1.0 was built by crawling the ".tr" and ".cy" inter...
The Croatian-English parallel corpus MaCoCu-hr-en 1.0 was built by crawling the ".hr" internet top-l...
The Slovene-English parallel corpus MaCoCu-sl-en 1.0 was built by crawling the ".si" internet top-le...
The Icelandic-English parallel corpus MaCoCu-is-en 1.0 was built by crawling the ".is" internet top-...
The Bosnian web corpus bsWaC was built by crawling the .ba top-level domain in 2014. The corpus was ...
The Catalan Textual Corpus is a 1760-million-token web corpus of Catalan built from several sources:...
The Montenegrin web corpus meWaC was built by crawling the .me top-level domain in 2019. The corpus ...
Over the last decade, methods of web corpus construction and the evaluation of web corpora have been...