BERTScore (Zhang et al., 2020), a recently proposed automatic metric for machine translation quality, uses BERT (Devlin et al., 2019), a large pre-trained language model to evaluate candidate translations with respect to a gold translation. Taking advantage of BERT’s semantic and syntactic abilities, BERTScore seeks to avoid the flaws of earlier approaches like BLEU, instead scoring candidate translations based on their semantic similarity to the gold sentence. However, BERT is not infallible; while its performance on NLP tasks set a new state of the art in general, studies of specific syntactic and semantic phenomena have shown where BERT’s performance deviates from that of humans more generally. This naturally raises the questions we addr...
We present a comparison of automatic metrics against human evaluations of translation quality in sev...
Recent machine translation shared tasks have shown top-performing systems to tie or in some cases ev...
We present the first ever results show-ing that tuning a machine translation sys-tem against a seman...
Since the advent of automatic evaluation, tasks within Natural Language Processing (NLP), including ...
In this position statement, we wish to contribute to the discussion about how to assess quality and ...
Reproducibility is of utmost concern in machine learning and natural language processing (NLP). In t...
Automatic metrics are fundamental for the development and evaluation of machine translation systems....
Assessing the quality of candidate translations involves diverse linguistic facets. However, most au...
Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new ...
Human ranking of machine translation output is a commonly used method for com-paring different innov...
Automatic evaluation metrics are fast and cost-effective measurements of the quality of a Machine Tr...
The quality of machine translation has increased remarkably over the past years, to the degree that ...
Automatic metrics play a crucial role in machine translation. Despite the widespread use of n-gram-b...
The strict character of most of the existing Machine Translation (MT) evaluation metrics does not pe...
This paper aims to automatically identify which linguistic phenomena represent barriers to better MT...
We present a comparison of automatic metrics against human evaluations of translation quality in sev...
Recent machine translation shared tasks have shown top-performing systems to tie or in some cases ev...
We present the first ever results show-ing that tuning a machine translation sys-tem against a seman...
Since the advent of automatic evaluation, tasks within Natural Language Processing (NLP), including ...
In this position statement, we wish to contribute to the discussion about how to assess quality and ...
Reproducibility is of utmost concern in machine learning and natural language processing (NLP). In t...
Automatic metrics are fundamental for the development and evaluation of machine translation systems....
Assessing the quality of candidate translations involves diverse linguistic facets. However, most au...
Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new ...
Human ranking of machine translation output is a commonly used method for com-paring different innov...
Automatic evaluation metrics are fast and cost-effective measurements of the quality of a Machine Tr...
The quality of machine translation has increased remarkably over the past years, to the degree that ...
Automatic metrics play a crucial role in machine translation. Despite the widespread use of n-gram-b...
The strict character of most of the existing Machine Translation (MT) evaluation metrics does not pe...
This paper aims to automatically identify which linguistic phenomena represent barriers to better MT...
We present a comparison of automatic metrics against human evaluations of translation quality in sev...
Recent machine translation shared tasks have shown top-performing systems to tie or in some cases ev...
We present the first ever results show-ing that tuning a machine translation sys-tem against a seman...