Quality estimation evaluation commonly takes the form of measurement of the error that exists between predictions and gold standard labels for a particular test set of translations. Issues can arise during com-parison of quality estimation prediction score distributions and gold label distribu-tions, however. In this paper, we provide an analysis of methods of comparison and identify areas of concern with respect to widely used measures, such as the ability to gain by prediction of aggregate statistics specific to gold label distributions or by optimally conservative variance in predic-tion score distributions. As an alternative, we propose the use of the unit-free Pear-son correlation, in addition to providing an appropriate method of sign...
Training and development data for the WMT18 QE task. Test data will be published as a separate item....
This paper presents the results of the WMT09 shared tasks, which included a translation task, a syst...
Automatic metrics are fundamental for the development and evaluation of machine translation systems....
This paper presents the use of consensus among Machine Translation (MT) systems for the WMT14 Qualit...
Most evaluation metrics for machine translation (MT) require reference translations for each sentenc...
Machine Translation Quality Estimation predicts quality scores for translations pro- duced by Machin...
Automatic Machine Translation metrics, such as BLEU, are widely used in empirical evaluation as a su...
Human-targeted metrics provide a compromise between human evaluation of machine translation, where h...
This paper presents the results of the WMT12 shared tasks, which included a translation task, a task...
Test data for the WMT18 QE task. Train data can be downloaded from http://hdl.handle.net/11372/LRT-2...
Research on translation quality annotation and estimation usually makes use of standard language, so...
We investigate the problem of predicting the quality of sentences produced by ma-chine translation s...
Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new ...
Research on translation quality annotation and estimation usually makes use of stan-dard language, s...
Automatic Machine Translation metrics, such as BLEU, are widely used in empirical evaluation as a s...
Training and development data for the WMT18 QE task. Test data will be published as a separate item....
This paper presents the results of the WMT09 shared tasks, which included a translation task, a syst...
Automatic metrics are fundamental for the development and evaluation of machine translation systems....
This paper presents the use of consensus among Machine Translation (MT) systems for the WMT14 Qualit...
Most evaluation metrics for machine translation (MT) require reference translations for each sentenc...
Machine Translation Quality Estimation predicts quality scores for translations pro- duced by Machin...
Automatic Machine Translation metrics, such as BLEU, are widely used in empirical evaluation as a su...
Human-targeted metrics provide a compromise between human evaluation of machine translation, where h...
This paper presents the results of the WMT12 shared tasks, which included a translation task, a task...
Test data for the WMT18 QE task. Train data can be downloaded from http://hdl.handle.net/11372/LRT-2...
Research on translation quality annotation and estimation usually makes use of standard language, so...
We investigate the problem of predicting the quality of sentences produced by ma-chine translation s...
Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new ...
Research on translation quality annotation and estimation usually makes use of stan-dard language, s...
Automatic Machine Translation metrics, such as BLEU, are widely used in empirical evaluation as a s...
Training and development data for the WMT18 QE task. Test data will be published as a separate item....
This paper presents the results of the WMT09 shared tasks, which included a translation task, a syst...
Automatic metrics are fundamental for the development and evaluation of machine translation systems....