Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU and the related NIST metric, are becoming increasingly important in MT. Yet, their behaviors are not fully understood. In this paper, we analyze some flaws in the BLEU/NIST metrics. With a better understanding of these problems, we can better interpret the reported BLEU/NIST scores. In addition, this paper reports a novel method of calculating the confidence intervals for BLEU/NIST scores using bootstrapping. With this method, we can determine whether two MT systems are significantly different from each other
This paper aims at providing a reliable method for measuring the correlations between different scor...
We report the results of an experiment to assess the ability of automated MT evaluation metrics to r...
Evaluation of machine translation (MT) output is a challenging task. In most cases, there is no sing...
Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU and the related NIST...
We argue that the machine translation community is overly reliant on the Bleu machine translation ev...
Automatic metrics are fundamental for the development and evaluation of machine translation systems....
The gold standard for measuring machine translation quality is the rating of candidate sentences by ...
Includes bibliographical references (pages 45-46).Statistical Machine Translation became the dominan...
Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU or NIST, are now w...
We describe a large-scale investigation of the correlation between human judgments of machine transl...
We describe a large-scale investigation of the correlation between human judgments of machine transl...
Translation systems are generally trained to optimize BLEU, but many alternative metrics are availab...
Evaluating the output quality of machine translation system requires test data and quality metrics t...
This paper aims at providing a reliable method for measuring the correlations between different scor...
If two translation systems differ differ in perfor-mance on a test set, can we trust that this indic...
This paper aims at providing a reliable method for measuring the correlations between different scor...
We report the results of an experiment to assess the ability of automated MT evaluation metrics to r...
Evaluation of machine translation (MT) output is a challenging task. In most cases, there is no sing...
Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU and the related NIST...
We argue that the machine translation community is overly reliant on the Bleu machine translation ev...
Automatic metrics are fundamental for the development and evaluation of machine translation systems....
The gold standard for measuring machine translation quality is the rating of candidate sentences by ...
Includes bibliographical references (pages 45-46).Statistical Machine Translation became the dominan...
Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU or NIST, are now w...
We describe a large-scale investigation of the correlation between human judgments of machine transl...
We describe a large-scale investigation of the correlation between human judgments of machine transl...
Translation systems are generally trained to optimize BLEU, but many alternative metrics are availab...
Evaluating the output quality of machine translation system requires test data and quality metrics t...
This paper aims at providing a reliable method for measuring the correlations between different scor...
If two translation systems differ differ in perfor-mance on a test set, can we trust that this indic...
This paper aims at providing a reliable method for measuring the correlations between different scor...
We report the results of an experiment to assess the ability of automated MT evaluation metrics to r...
Evaluation of machine translation (MT) output is a challenging task. In most cases, there is no sing...