How Much Data is Needed for Reliable MT Evaluation? Using Bootstrapping to Study Human and Automatic Metrics

Estrella, Paula Susana
Hamon, Olivier
Popescu-Belis, Andréi

Publication date

January 2007

Abstract

Evaluating the output quality of machine translation system requires test data and quality metrics to be applied. Based on the results of the French MT evaluation campaign CESTA, this paper studies the statistical reliability of the scores depending on the amount of test data used to obtain them. Bootstrapping is used to compute standard deviation of scores assigned by human judges (mainly of adequacy) as well as of five automatic metrics. The reliability of the scores is measured using two formal criteria, and the minimal number of documents or segments needed to reach reliable scores is estimated. This number does not depend on the exact subset of documents that is used

Extracted data

We use cookies to provide a better user experience.

Data Protection

How Much Data is Needed for Reliable MT Evaluation? Using Bootstrapping to Study Human and Automatic Metrics

Abstract

Extracted data

How Much Data is Needed for Reliable MT Evaluation? Using Bootstrapping to Study Human and Automatic Metrics

Abstract

Extracted data

Related items

Related items