Recent advancements in reference-free learned metrics for open-domain dialogue evaluation have been driven by the progress in pre-trained language models and the availability of dialogue data with high-quality human annotations. However, current studies predominantly concentrate on English dialogues, and the generalization of these metrics to other languages has not been fully examined. This is largely due to the absence of a multilingual dialogue evaluation benchmark. To address the issue, we introduce xDial-Eval, built on top of open-source English dialogue evaluation datasets. xDial-Eval includes 12 turn-level and 6 dialogue-level English datasets, comprising 14930 annotated turns and 8691 annotated dialogues respectively. The English di...
During the past decade, several areas of speech and language understanding have witnessed substantia...
Practical dialog systems need to deal with various knowledge sources, noisy user expressions, and th...
Achieving robust language technologies that can perform well across the world's many languages is a ...
The advent and fast development of neural networks have revolutionized the research on dialogue syst...
The main limiting factor in the development of robust multilingual dialogue evaluation metrics is th...
Despite significant research effort in the development of automatic dialogue evaluation metrics, lit...
Despite tremendous advancements in dialogue systems, stable evaluation still requires human judgment...
Evaluation is a critical element in the development process of many natural language based systems. ...
We present a systematic study and comprehensive evaluation of large language models for automatic mu...
Chatbots are designed to carry out human-like conversations across different domains, such as genera...
Large Language Models (LLMs) have demonstrated impressive performance on Natural Language Processing...
Natural Language Generation (NLG) typically involves evaluating the generated text in various aspect...
LLMs (large language models) such as ChatGPT have shown remarkable language understanding and genera...
The brittleness of finetuned language model performance on out-of-distribution (OOD) test samples in...
Despite recent progress in open-domain dialogue evaluation, how to develop automatic metrics remains...
During the past decade, several areas of speech and language understanding have witnessed substantia...
Practical dialog systems need to deal with various knowledge sources, noisy user expressions, and th...
Achieving robust language technologies that can perform well across the world's many languages is a ...
The advent and fast development of neural networks have revolutionized the research on dialogue syst...
The main limiting factor in the development of robust multilingual dialogue evaluation metrics is th...
Despite significant research effort in the development of automatic dialogue evaluation metrics, lit...
Despite tremendous advancements in dialogue systems, stable evaluation still requires human judgment...
Evaluation is a critical element in the development process of many natural language based systems. ...
We present a systematic study and comprehensive evaluation of large language models for automatic mu...
Chatbots are designed to carry out human-like conversations across different domains, such as genera...
Large Language Models (LLMs) have demonstrated impressive performance on Natural Language Processing...
Natural Language Generation (NLG) typically involves evaluating the generated text in various aspect...
LLMs (large language models) such as ChatGPT have shown remarkable language understanding and genera...
The brittleness of finetuned language model performance on out-of-distribution (OOD) test samples in...
Despite recent progress in open-domain dialogue evaluation, how to develop automatic metrics remains...
During the past decade, several areas of speech and language understanding have witnessed substantia...
Practical dialog systems need to deal with various knowledge sources, noisy user expressions, and th...
Achieving robust language technologies that can perform well across the world's many languages is a ...