LLMs (large language models) such as ChatGPT have shown remarkable language understanding and generation capabilities. Although reference-free evaluators based on LLMs show better human alignment than traditional reference-based evaluators, there are many challenges in using reference-free evaluators based on LLMs. Reference-free evaluators are more suitable for open-ended examples with different semantics responses. But not all examples are open-ended. For closed-ended examples with unique correct semantic response, reference-free evaluators will still consider it high quality when giving a response that is inconsistent with the facts and the semantic of reference. In order to comprehensively evaluate the reliability of evaluators based on...
With ChatGPT-like large language models (LLM) prevailing in the community, how to evaluate the abili...
There are many ways to express similar things in text, which makes evaluating natural language gener...
Human ratings are the gold standard in NLG evaluation. The standard protocol is to collect ratings o...
In this paper, we uncover a systematic bias in the evaluation paradigm of adopting large language mo...
As large language models (LLMs) continue to advance, accurately and comprehensively evaluating their...
The recent success of large language models (LLMs) has shown great potential to develop more powerfu...
Despite tremendous advancements in dialogue systems, stable evaluation still requires human judgment...
Despite significant research effort in the development of automatic dialogue evaluation metrics, lit...
Semantic consistency of a language model is broadly defined as the model's ability to produce semant...
We present an empirical evaluation of various outputs generated by nine of the most widely-available...
The development of large language models (LLMs) such as ChatGPT has brought a lot of attention recen...
Grounded text generation systems often generate text that contains factual inconsistencies, hinderin...
The recent popularity of large language models (LLMs) has brought a significant impact to boundless ...
The Natural Language Generation (NLG) community relies on shared evaluation techniques to understand...
Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing ...
With ChatGPT-like large language models (LLM) prevailing in the community, how to evaluate the abili...
There are many ways to express similar things in text, which makes evaluating natural language gener...
Human ratings are the gold standard in NLG evaluation. The standard protocol is to collect ratings o...
In this paper, we uncover a systematic bias in the evaluation paradigm of adopting large language mo...
As large language models (LLMs) continue to advance, accurately and comprehensively evaluating their...
The recent success of large language models (LLMs) has shown great potential to develop more powerfu...
Despite tremendous advancements in dialogue systems, stable evaluation still requires human judgment...
Despite significant research effort in the development of automatic dialogue evaluation metrics, lit...
Semantic consistency of a language model is broadly defined as the model's ability to produce semant...
We present an empirical evaluation of various outputs generated by nine of the most widely-available...
The development of large language models (LLMs) such as ChatGPT has brought a lot of attention recen...
Grounded text generation systems often generate text that contains factual inconsistencies, hinderin...
The recent popularity of large language models (LLMs) has brought a significant impact to boundless ...
The Natural Language Generation (NLG) community relies on shared evaluation techniques to understand...
Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing ...
With ChatGPT-like large language models (LLM) prevailing in the community, how to evaluate the abili...
There are many ways to express similar things in text, which makes evaluating natural language gener...
Human ratings are the gold standard in NLG evaluation. The standard protocol is to collect ratings o...