The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics. For more details and raw data download please visit: http://paracrawl.eu/releases.htm
This work presents a straightforward method for extending or creating in-domain web corpora by focus...
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
In the recent years, Transformer-based models have lead to significant advances in language modellin...
The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel c...
We report on methods to create the largest publicly available parallel corpora by crawling the web, ...
This corpus was originally created for performance testing (server infrastructure CorpusExplorer - s...
Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks....
The Turkish web corpus MaCoCu-tr 1.0 was built by crawling the ".tr" and ".cy" internet top-level do...
The Maltese web corpus MaCoCu-mt 1.0 was built by crawling the ".mt" internet top-level domain in 20...
The Bulgarian web corpus MaCoCu-bg 1.0 was built by crawling the ".bg" and ".бг" internet top-level ...
The Turkish-English parallel corpus MaCoCu-tr-en 1.0 was built by crawling the ".tr" and ".cy" inter...
The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stage...
The Macedonian-English parallel corpus MaCoCu-mk-en 1.0 was built by crawling the ".mk" and ".мкд" i...
The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stage...
The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" int...
This work presents a straightforward method for extending or creating in-domain web corpora by focus...
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
In the recent years, Transformer-based models have lead to significant advances in language modellin...
The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel c...
We report on methods to create the largest publicly available parallel corpora by crawling the web, ...
This corpus was originally created for performance testing (server infrastructure CorpusExplorer - s...
Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks....
The Turkish web corpus MaCoCu-tr 1.0 was built by crawling the ".tr" and ".cy" internet top-level do...
The Maltese web corpus MaCoCu-mt 1.0 was built by crawling the ".mt" internet top-level domain in 20...
The Bulgarian web corpus MaCoCu-bg 1.0 was built by crawling the ".bg" and ".бг" internet top-level ...
The Turkish-English parallel corpus MaCoCu-tr-en 1.0 was built by crawling the ".tr" and ".cy" inter...
The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stage...
The Macedonian-English parallel corpus MaCoCu-mk-en 1.0 was built by crawling the ".mk" and ".мкд" i...
The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stage...
The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" int...
This work presents a straightforward method for extending or creating in-domain web corpora by focus...
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
In the recent years, Transformer-based models have lead to significant advances in language modellin...