If two translation systems differ differ in perfor-mance on a test set, can we trust that this indicates a difference in true system quality? To answer this question, we describe bootstrap resampling meth-ods to compute statistical significance of test results, and validate them on the concrete example of the BLEU score. Even for small test sizes of only 300 sentences, our methods may give us assurances that test result differences are real.
Automatic metrics are fundamental for the development and evaluation of machine translation systems....
This paper aims at providing a reliable method for measuring the correlations between different scor...
The effect of translationese has been studied in the field of machine translation (MT), mostly with ...
Randomized methods of significance test-ing enable estimation of the probability that an increase in...
Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU and the related NIST...
Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU and the related NIST...
Automatic metrics are widely used in ma-chine translation as a substitute for hu-man assessment. Wit...
The term translationese has been used to describe features of translated text, and in this paper, we...
The effect of translationese has been studied in the field of machine translation (MT), mostly with ...
The effect of translationese has been studied in the field of machine translation (MT), mostly with ...
The effect of translationese has been studied in the field of machine translation (MT), mostly with ...
We argue that the machine translation community is overly reliant on the Bleu machine translation ev...
We investigate the use of Fisher’s exact significance test for pruning the transla-tion table of a h...
Evaluating the output quality of machine translation system requires test data and quality metrics t...
The effect of translationese has been studied in the field of machine translation (MT), mostly with ...
Automatic metrics are fundamental for the development and evaluation of machine translation systems....
This paper aims at providing a reliable method for measuring the correlations between different scor...
The effect of translationese has been studied in the field of machine translation (MT), mostly with ...
Randomized methods of significance test-ing enable estimation of the probability that an increase in...
Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU and the related NIST...
Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU and the related NIST...
Automatic metrics are widely used in ma-chine translation as a substitute for hu-man assessment. Wit...
The term translationese has been used to describe features of translated text, and in this paper, we...
The effect of translationese has been studied in the field of machine translation (MT), mostly with ...
The effect of translationese has been studied in the field of machine translation (MT), mostly with ...
The effect of translationese has been studied in the field of machine translation (MT), mostly with ...
We argue that the machine translation community is overly reliant on the Bleu machine translation ev...
We investigate the use of Fisher’s exact significance test for pruning the transla-tion table of a h...
Evaluating the output quality of machine translation system requires test data and quality metrics t...
The effect of translationese has been studied in the field of machine translation (MT), mostly with ...
Automatic metrics are fundamental for the development and evaluation of machine translation systems....
This paper aims at providing a reliable method for measuring the correlations between different scor...
The effect of translationese has been studied in the field of machine translation (MT), mostly with ...