The slenWaC corpus version 1.0 consists of parallel Slovene-English texts crawled from the .si top-level domain for Slovenia. The corpus was built with Spidextor (https://github.com/abumatran/spidextor), a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext on the segment level is around 67% and on the word level around 68%
written; domain-specific (newspaper); synchronic; bilingual; parallel; unidirectional; XML; S-alignm...
The Serbian web corpus srWaC was built by crawling the .rs top-level domain in 2014. The corpus was ...
The Croatian-English parallel corpus MaCoCu-hr-en 1.0 was built by crawling the ".hr" internet top-l...
The srenWaC corpus consists of sentence aligned parallel Serbian-English texts crawled from the .rs ...
The hrenWaC corpus version 2.0 consists of parallel Croatian-English texts crawled from the .hr top-...
The fienWaC corpus version 1.0 consists of parallel Finnish-English texts crawled from the .fi top-l...
The Slovene-English parallel corpus MaCoCu-sl-en 1.0 was built by crawling the ".si" internet top-le...
The availability of large collections of text (language corpora) is crucial for empirically supporte...
The RSDO4 parallel corpus of English-Slovene and Slovene-English translation pairs was collected as ...
The RSDO4 parallel corpus of English-Slovene and Slovene-English translation pairs was collected as ...
The paper presents the SVEZ-IJS corpus, a large parallel annotated English-Slovene corpus containing...
This paper presents an approach for building large monolingual corpora and, at the same time, extrac...
The Macedonian-English parallel corpus MaCoCu-mk-en 1.0 was built by crawling the ".mk" and ".мкд" i...
The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" int...
Področje procesiranja naravnega jezika je pomembna in obsežna panoga računalništva, vendar je večina...
written; domain-specific (newspaper); synchronic; bilingual; parallel; unidirectional; XML; S-alignm...
The Serbian web corpus srWaC was built by crawling the .rs top-level domain in 2014. The corpus was ...
The Croatian-English parallel corpus MaCoCu-hr-en 1.0 was built by crawling the ".hr" internet top-l...
The srenWaC corpus consists of sentence aligned parallel Serbian-English texts crawled from the .rs ...
The hrenWaC corpus version 2.0 consists of parallel Croatian-English texts crawled from the .hr top-...
The fienWaC corpus version 1.0 consists of parallel Finnish-English texts crawled from the .fi top-l...
The Slovene-English parallel corpus MaCoCu-sl-en 1.0 was built by crawling the ".si" internet top-le...
The availability of large collections of text (language corpora) is crucial for empirically supporte...
The RSDO4 parallel corpus of English-Slovene and Slovene-English translation pairs was collected as ...
The RSDO4 parallel corpus of English-Slovene and Slovene-English translation pairs was collected as ...
The paper presents the SVEZ-IJS corpus, a large parallel annotated English-Slovene corpus containing...
This paper presents an approach for building large monolingual corpora and, at the same time, extrac...
The Macedonian-English parallel corpus MaCoCu-mk-en 1.0 was built by crawling the ".mk" and ".мкд" i...
The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" int...
Področje procesiranja naravnega jezika je pomembna in obsežna panoga računalništva, vendar je večina...
written; domain-specific (newspaper); synchronic; bilingual; parallel; unidirectional; XML; S-alignm...
The Serbian web corpus srWaC was built by crawling the .rs top-level domain in 2014. The corpus was ...
The Croatian-English parallel corpus MaCoCu-hr-en 1.0 was built by crawling the ".hr" internet top-l...