In this work we present the fundamentals of the IQMT frame-work for MT evaluation. IQMT offers a common work-bench on which existing evaluation metrics can be utilized. We suggest the IQ measure and test it on the Chinese-to-English data from the IWSLT 2004 Evaluation Campaign. We show how the correlation with human assessments at the system level improves substantially for most individual met-rics. Moreover, IQMT allows to robustly combine several met-rics avoiding scaling problems and metric weightings. Sev-eral metric combinations were tried, but correlations did not further improve significantly. 1
The WMT evaluation campaign (http://www.statmt.org/wmt16) has been run annually since 2006. It is a ...
We propose three new features for MT evaluation: source-sentence constrained n-gram precision, sourc...
The DARPA MT evaluations of the early 1990s, along with subsequent work on the MT Scale, and the Int...
This report presents a description and tutorial on the IQMT Framework for Machine Translation Evalua...
Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new ...
Evaluation of MT evaluation measures is limited by inconsistent human judgment data. Nonetheless, ma...
Machine Translation (MT) systems are more complex to test than they appear to be at first, since man...
Evaluation of machine translation (MT) is a difficult task, both for humans, and using automatic met...
Most evaluation metrics for machine translation (MT) require reference translations for each sentenc...
Evaluation of machine translation (MT) output is a challenging task. In most cases, there is no sing...
This paper presents the results of the WMT17 Metrics Shared Task. We asked participants of this task...
Automatic Machine Translation (MT) evaluation metrics have traditionally been evaluated by the corre...
In this paper, we describe our submission to the WMT 2021 Metrics Shared Task. We use the automatica...
The DARPA MT evaluations of the early 1990s, along with subsequent work on the MT Scale, and the Int...
This paper aims at providing a reliable method for measuring the correlations between different scor...
The WMT evaluation campaign (http://www.statmt.org/wmt16) has been run annually since 2006. It is a ...
We propose three new features for MT evaluation: source-sentence constrained n-gram precision, sourc...
The DARPA MT evaluations of the early 1990s, along with subsequent work on the MT Scale, and the Int...
This report presents a description and tutorial on the IQMT Framework for Machine Translation Evalua...
Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new ...
Evaluation of MT evaluation measures is limited by inconsistent human judgment data. Nonetheless, ma...
Machine Translation (MT) systems are more complex to test than they appear to be at first, since man...
Evaluation of machine translation (MT) is a difficult task, both for humans, and using automatic met...
Most evaluation metrics for machine translation (MT) require reference translations for each sentenc...
Evaluation of machine translation (MT) output is a challenging task. In most cases, there is no sing...
This paper presents the results of the WMT17 Metrics Shared Task. We asked participants of this task...
Automatic Machine Translation (MT) evaluation metrics have traditionally been evaluated by the corre...
In this paper, we describe our submission to the WMT 2021 Metrics Shared Task. We use the automatica...
The DARPA MT evaluations of the early 1990s, along with subsequent work on the MT Scale, and the Int...
This paper aims at providing a reliable method for measuring the correlations between different scor...
The WMT evaluation campaign (http://www.statmt.org/wmt16) has been run annually since 2006. It is a ...
We propose three new features for MT evaluation: source-sentence constrained n-gram precision, sourc...
The DARPA MT evaluations of the early 1990s, along with subsequent work on the MT Scale, and the Int...