Reproducibility is of utmost concern in machine learning and natural language processing (NLP). In the field of natural language generation (especially machine translation), the seminal paper of Post (2018) has pointed out problems of reproducibility of the dominant metric, BLEU, at the time of publication. Nowadays, BERT-based evaluation metrics considerably outperform BLEU. In this paper, we ask whether results and claims from four recent BERT-based metrics can be reproduced. We find that reproduction of claims and results often fails because of (i) heavy undocumented preprocessing involved in the metrics, (ii) missing code and (iii) reporting weaker results for the baseline metrics. (iv) In one case, the problem stems from correlating no...
This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA)...
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitab...
We report our efforts in identifying a set of previous humane valuations in NLP that would be suitab...
Fine-tuning pre-trained models have achieved impressive performance on standard natural language pro...
Automatic evaluation of language generation systems is a well-studied problem in Natural Language Pr...
BERTScore (Zhang et al., 2020), a recently proposed automatic metric for machine translation quality...
Evaluating generated text received new attention with the introduction of model-based metrics in rec...
Against a background of growing interest in reproducibility in NLP and ML, and as part of an ongoing...
In this position statement, we wish to contribute to the discussion about how to assess quality and ...
Automatic metrics play a crucial role in machine translation. Despite the widespread use of n-gram-b...
The evaluation of recent embedding-based evaluation metrics for text generation is primarily based o...
Against the background of what has been termed a reproducibility crisis in science, the NLP field is...
We explore efficient evaluation metrics for Natural Language Generation (NLG). To implement efficien...
This paper reports results from a reproduction study in which we repeated the human evaluation of th...
Reproducibility has become an increasingly debated topic in NLP and ML over recent years, but so far...
This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA)...
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitab...
We report our efforts in identifying a set of previous humane valuations in NLP that would be suitab...
Fine-tuning pre-trained models have achieved impressive performance on standard natural language pro...
Automatic evaluation of language generation systems is a well-studied problem in Natural Language Pr...
BERTScore (Zhang et al., 2020), a recently proposed automatic metric for machine translation quality...
Evaluating generated text received new attention with the introduction of model-based metrics in rec...
Against a background of growing interest in reproducibility in NLP and ML, and as part of an ongoing...
In this position statement, we wish to contribute to the discussion about how to assess quality and ...
Automatic metrics play a crucial role in machine translation. Despite the widespread use of n-gram-b...
The evaluation of recent embedding-based evaluation metrics for text generation is primarily based o...
Against the background of what has been termed a reproducibility crisis in science, the NLP field is...
We explore efficient evaluation metrics for Natural Language Generation (NLG). To implement efficien...
This paper reports results from a reproduction study in which we repeated the human evaluation of th...
Reproducibility has become an increasingly debated topic in NLP and ML over recent years, but so far...
This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA)...
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitab...
We report our efforts in identifying a set of previous humane valuations in NLP that would be suitab...