Natural language processing researchers have identified limitations of evaluation methodology for generation tasks, with new questions raised about the validity of automatic metrics and of crowdworker judgments. Meanwhile, efforts to improve generation models tend to depend on simple n-gram overlap metrics (e.g., BLEU, ROUGE). We argue that new advances on models and metrics should each more directly benefit and inform the other. We therefore propose a generalization of leaderboards, bidimensional leaderboards (Billboards), that simultaneously tracks progress in language generation models and metrics for their evaluation. Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accept...
We consider the evaluation problem in Natural Language Generation (NLG) and present results for eval...
International audienceWe introduce GEM, a living benchmark for natural language Generation (NLG), it...
International audienceWe introduce GEM, a living benchmark for natural language Generation (NLG), it...
Automatic evaluation of language generation systems is a well-studied problem in Natural Language Pr...
Driven by deep learning breakthroughs, natural language generation (NLG) models have been at the cen...
Driven by deep learning breakthroughs, natural language generation (NLG) models have been at the cen...
Evaluating generated text received new attention with the introduction of model-based metrics in rec...
Evaluating generated text received new attention with the introduction of model-based metrics in rec...
Evaluating generated text received new attention with the introduction of model-based metrics in rec...
Automatic evaluation metrics capable of replacing human judgments are critical to allowing fast deve...
We explore efficient evaluation metrics for Natural Language Generation (NLG). To implement efficien...
A major challenge in the field of Text Generation is evaluation, because we lack a sound theory that...
A major challenge in the field of Text Generation is evaluation: Human evaluations are cost-intensiv...
Large language models (LLMs) have demonstrated significant capability to generalize across a large n...
We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metr...
We consider the evaluation problem in Natural Language Generation (NLG) and present results for eval...
International audienceWe introduce GEM, a living benchmark for natural language Generation (NLG), it...
International audienceWe introduce GEM, a living benchmark for natural language Generation (NLG), it...
Automatic evaluation of language generation systems is a well-studied problem in Natural Language Pr...
Driven by deep learning breakthroughs, natural language generation (NLG) models have been at the cen...
Driven by deep learning breakthroughs, natural language generation (NLG) models have been at the cen...
Evaluating generated text received new attention with the introduction of model-based metrics in rec...
Evaluating generated text received new attention with the introduction of model-based metrics in rec...
Evaluating generated text received new attention with the introduction of model-based metrics in rec...
Automatic evaluation metrics capable of replacing human judgments are critical to allowing fast deve...
We explore efficient evaluation metrics for Natural Language Generation (NLG). To implement efficien...
A major challenge in the field of Text Generation is evaluation, because we lack a sound theory that...
A major challenge in the field of Text Generation is evaluation: Human evaluations are cost-intensiv...
Large language models (LLMs) have demonstrated significant capability to generalize across a large n...
We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metr...
We consider the evaluation problem in Natural Language Generation (NLG) and present results for eval...
International audienceWe introduce GEM, a living benchmark for natural language Generation (NLG), it...
International audienceWe introduce GEM, a living benchmark for natural language Generation (NLG), it...