In this paper, we present the results of two re- production studies for the human evaluation originally reported by Dušek and Kasner (2020) in which the authors comparatively evaluated outputs produced by a semantic error detection system for data-to-text generation against ref- erence outputs. In the first reproduction, the original evaluators repeat the evaluation, in a test of the repeatability of the original evaluation. In the second study, two new evaluators carry out the evaluation task, in a test of the reproducibility of the original evaluation under otherwise identical conditions. We describe our approach to reproduction, and present and analyse results, finding different degrees of re- producibility depending on result type, data...
While automatically computing numerical scores remains the dominant paradigm in NLP system evaluatio...
Earlier research has shown that few studies in Natural Language Generation (NLG) evaluate their syst...
Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches a...
In this paper, we present the results of two re- production studies for the human evaluation origina...
In this paper, we present the results of two reproduction studies for the human evaluation originall...
In this paper we report our reproduction study of the Croatian part of an annotation-based human eva...
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitab...
This paper reports results from a reproduction study in which we repeated the human evaluation of th...
We report our efforts in identifying a set of previous humane valuations in NLP that would be suitab...
Against a background of growing interest in reproducibility in NLP and ML, and as part of an ongoing...
The Natural Language Generation (NLG) community relies on shared evaluation techniques to understand...
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitab...
ABSTRACT Many evaluation issues for grammatical error detection have previously been overlooked, mak...
With the fast-growing popularity of current large pre-trained language models (LLMs), it is necessar...
In this paper we describe our reproduction study of the human evaluation of text simplic- ity report...
While automatically computing numerical scores remains the dominant paradigm in NLP system evaluatio...
Earlier research has shown that few studies in Natural Language Generation (NLG) evaluate their syst...
Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches a...
In this paper, we present the results of two re- production studies for the human evaluation origina...
In this paper, we present the results of two reproduction studies for the human evaluation originall...
In this paper we report our reproduction study of the Croatian part of an annotation-based human eva...
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitab...
This paper reports results from a reproduction study in which we repeated the human evaluation of th...
We report our efforts in identifying a set of previous humane valuations in NLP that would be suitab...
Against a background of growing interest in reproducibility in NLP and ML, and as part of an ongoing...
The Natural Language Generation (NLG) community relies on shared evaluation techniques to understand...
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitab...
ABSTRACT Many evaluation issues for grammatical error detection have previously been overlooked, mak...
With the fast-growing popularity of current large pre-trained language models (LLMs), it is necessar...
In this paper we describe our reproduction study of the human evaluation of text simplic- ity report...
While automatically computing numerical scores remains the dominant paradigm in NLP system evaluatio...
Earlier research has shown that few studies in Natural Language Generation (NLG) evaluate their syst...
Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches a...