Realizing general-purpose language intelligence has been a longstanding goal for natural language processing, where standard evaluation benchmarks play a fundamental and guiding role. We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic. To this end, we propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features: (1) Hierarchical benchmark framework, where datasets are principally selected and organized with a language capability-task-dataset hierarchy. (2) Multi-level scoring strategy, where different levels of model performance are provided based on the hierarchical framework. To facilitate CUGE, we provide a pub...
In this paper, we introduce an advanced Russian general language understanding evaluation benchmark ...
In this thesis, I show the advantages of using symbolic parsers for Grammatical Error Detection and ...
This dissertation defends in some small measure the thesis that there is a universal parsing model f...
With the continuous emergence of Chinese Large Language Models (LLMs), how to evaluate a model's cap...
Artificial Intelligence (AI), along with the recent progress in biomedical language understanding, i...
Multilingual pre-trained language models have shown impressive performance on cross-lingual tasks. I...
Chinese Grammatical Error Correction (CGEC) aims to automatically detect and correct grammatical err...
Offensive language detection is increasingly crucial for maintaining a civilized social media platfo...
Purpose: The purpose of this study was to evaluate performance of the Language Environment Analysis ...
Through the development of large-scale natural language models with writing and dialogue capabilitie...
International audienceWe introduce GEM, a living benchmark for natural language Generation (NLG), it...
Large-scale language models (LLMs) has shown remarkable capability in various of Natural Language Pr...
Evaluation in machine learning is usually informed by past choices, for example which datasets or me...
Practical dialog systems need to deal with various knowledge sources, noisy user expressions, and th...
Large-scale pre-training has shown remarkable performance in building open-domain dialogue systems. ...
In this paper, we introduce an advanced Russian general language understanding evaluation benchmark ...
In this thesis, I show the advantages of using symbolic parsers for Grammatical Error Detection and ...
This dissertation defends in some small measure the thesis that there is a universal parsing model f...
With the continuous emergence of Chinese Large Language Models (LLMs), how to evaluate a model's cap...
Artificial Intelligence (AI), along with the recent progress in biomedical language understanding, i...
Multilingual pre-trained language models have shown impressive performance on cross-lingual tasks. I...
Chinese Grammatical Error Correction (CGEC) aims to automatically detect and correct grammatical err...
Offensive language detection is increasingly crucial for maintaining a civilized social media platfo...
Purpose: The purpose of this study was to evaluate performance of the Language Environment Analysis ...
Through the development of large-scale natural language models with writing and dialogue capabilitie...
International audienceWe introduce GEM, a living benchmark for natural language Generation (NLG), it...
Large-scale language models (LLMs) has shown remarkable capability in various of Natural Language Pr...
Evaluation in machine learning is usually informed by past choices, for example which datasets or me...
Practical dialog systems need to deal with various knowledge sources, noisy user expressions, and th...
Large-scale pre-training has shown remarkable performance in building open-domain dialogue systems. ...
In this paper, we introduce an advanced Russian general language understanding evaluation benchmark ...
In this thesis, I show the advantages of using symbolic parsers for Grammatical Error Detection and ...
This dissertation defends in some small measure the thesis that there is a universal parsing model f...