The Machine Translation datasets KAS-MT 1.0 contain automatically sentence-aligned Slovene and English plain-text abstracts from KAS-Abs 2.0 (http://hdl.handle.net/11356/1449) and is meant for studies in machine translation. The setence alignment approach used requires an alignment reliability threshold that omits candidate pairs below a certain value. This value represents a trade-off between the quantity and quality of aligned pairs. We estimate that the default threshold value produces a good-quality dataset for most users. We release three such datasets (files) that reflect a trade-off between quality and quantity of the data. The Normal dataset uses the default reliability threshold and contains 496,102 sentence pairs, the St...
The KAS-abs corpus contains 108,254 automatically identified Slovenian and/or English abstracts (30 ...
In recent years, significant improvements have been achieved in statistical machine translation (MT)...
The KAS-mag corpus of Slovene MSc/MA theses consists of almost 16,000 texts (1,360 thousand pages or...
The training process of the translation model in statistical machine translation requires a sentence...
In machine translation, the alignment of corpora has evolved into a mature research area, aimed at p...
Corpus of Academic Slovene (KAS) contains Slovene BSc/BA, MSc/MA, and PhD theses from 2000 - 2018. W...
UnrestrictedAll state of the art statistical machine translation systems and many example-based mach...
Sentence alignment represents the basis for computer-assisted translation (CAT), terminology managem...
When parallel or comparable corpora are harvested from the web, there is typically a tradeoff betwee...
Statistical Word Alignments represent lexical word-to-word translations between source and target la...
The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600...
Machine translation has advanced considerably in recent years, primarily due to the availability of ...
In most statistical machine translation (SMT) systems, bilingual segments are ex-tracted via word al...
The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600...
Training a state-of-the-art syntax-based statistical machine translation (MT) system to translate fr...
The KAS-abs corpus contains 108,254 automatically identified Slovenian and/or English abstracts (30 ...
In recent years, significant improvements have been achieved in statistical machine translation (MT)...
The KAS-mag corpus of Slovene MSc/MA theses consists of almost 16,000 texts (1,360 thousand pages or...
The training process of the translation model in statistical machine translation requires a sentence...
In machine translation, the alignment of corpora has evolved into a mature research area, aimed at p...
Corpus of Academic Slovene (KAS) contains Slovene BSc/BA, MSc/MA, and PhD theses from 2000 - 2018. W...
UnrestrictedAll state of the art statistical machine translation systems and many example-based mach...
Sentence alignment represents the basis for computer-assisted translation (CAT), terminology managem...
When parallel or comparable corpora are harvested from the web, there is typically a tradeoff betwee...
Statistical Word Alignments represent lexical word-to-word translations between source and target la...
The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600...
Machine translation has advanced considerably in recent years, primarily due to the availability of ...
In most statistical machine translation (SMT) systems, bilingual segments are ex-tracted via word al...
The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600...
Training a state-of-the-art syntax-based statistical machine translation (MT) system to translate fr...
The KAS-abs corpus contains 108,254 automatically identified Slovenian and/or English abstracts (30 ...
In recent years, significant improvements have been achieved in statistical machine translation (MT)...
The KAS-mag corpus of Slovene MSc/MA theses consists of almost 16,000 texts (1,360 thousand pages or...