Automatic metrics play a crucial role in machine translation. Despite the widespread use of n-gram-based metrics, there has been a recent surge in the development of pre-trained model-based metrics that focus on measuring sentence semantics. However, these neural metrics, while achieving higher correlations with human evaluations, are often considered to be black boxes with potential biases that are difficult to detect. In this study, we systematically analyze and compare various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems. Through Minimum Risk Training (MRT), we find that certain metrics exhibit robustness defects, such as the presence of universal adversaria...
This report presents an automatic evaluation of the general machine translation task of the Seventh ...
Recently novel MT evaluation metrics have been presented which go beyond pure string matching, and w...
Neural metrics have achieved impressive correlation with human judgements in the evaluation of machi...
Neural metrics have achieved impressive correlation with human judgements in the evaluation of machi...
We present a comparison of automatic metrics against human evaluations of translation quality in sev...
Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new ...
Discriminative training, a.k.a. tuning, is an important part of Statistical Machine Translation. Thi...
Automatic metrics are fundamental for the development and evaluation of machine translation systems....
Automatic evaluation of language generation systems is a well-studied problem in Natural Language Pr...
Translations generated by current statistical systems often have a large variance, in terms of their...
As machine translation (MT) metrics improve their correlation with human judgement every year, it is...
The problem of evaluating machine translation (MT) systems is more challenging than it may first app...
Human-targeted metrics provide a compromise between human evaluation of machine translation, where h...
We present the first ever results show-ing that tuning a machine translation sys-tem against a seman...
Translation systems are generally trained to optimize BLEU, but many alternative metrics are availab...
This report presents an automatic evaluation of the general machine translation task of the Seventh ...
Recently novel MT evaluation metrics have been presented which go beyond pure string matching, and w...
Neural metrics have achieved impressive correlation with human judgements in the evaluation of machi...
Neural metrics have achieved impressive correlation with human judgements in the evaluation of machi...
We present a comparison of automatic metrics against human evaluations of translation quality in sev...
Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new ...
Discriminative training, a.k.a. tuning, is an important part of Statistical Machine Translation. Thi...
Automatic metrics are fundamental for the development and evaluation of machine translation systems....
Automatic evaluation of language generation systems is a well-studied problem in Natural Language Pr...
Translations generated by current statistical systems often have a large variance, in terms of their...
As machine translation (MT) metrics improve their correlation with human judgement every year, it is...
The problem of evaluating machine translation (MT) systems is more challenging than it may first app...
Human-targeted metrics provide a compromise between human evaluation of machine translation, where h...
We present the first ever results show-ing that tuning a machine translation sys-tem against a seman...
Translation systems are generally trained to optimize BLEU, but many alternative metrics are availab...
This report presents an automatic evaluation of the general machine translation task of the Seventh ...
Recently novel MT evaluation metrics have been presented which go beyond pure string matching, and w...
Neural metrics have achieved impressive correlation with human judgements in the evaluation of machi...