Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation. In this work, we address the evaluation of generative comprehension in MLLMs as a preliminary step towards a comprehensive assessment of generative models, by introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple choice questions with accurate human annotations (x 6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality. We develop an advanced pipeline for generating multiple-choice questions that target specifi...
Large Language Models (LLMs) have demonstrated impressive performance on Natural Language Processing...
With ChatGPT-like large language models (LLM) prevailing in the community, how to evaluate the abili...
We provide a new multi-task benchmark for evaluating text-to-image models. We perform a human evalua...
The popularity of multimodal large language models (MLLMs) has triggered a recent surge in research ...
Large language models (LLMs) have garnered significant attention, but the definition of "large" lack...
We present Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 lang...
Recent developments in large language models (LLMs) have shown promise in enhancing the capabilities...
As large language models (LLMs) continue to advance, accurately and comprehensively evaluating their...
Scaling language models with more data, compute and parameters has driven significant progress in na...
Large Language Models (LLMs) have demonstrated remarkable performance on various quantitative reason...
Large Language Models (LLMs) have demonstrated remarkable performance on various quantitative reason...
As the performance of large language models rapidly improves, benchmarks are getting larger and more...
Recently, large language models (LLMs), including notable models such as GPT-4 and burgeoning commun...
The development of large language models (LLMs) such as ChatGPT has brought a lot of attention recen...
Large language models (LLMs) are a special class of pretrained language models obtained by scaling m...
Large Language Models (LLMs) have demonstrated impressive performance on Natural Language Processing...
With ChatGPT-like large language models (LLM) prevailing in the community, how to evaluate the abili...
We provide a new multi-task benchmark for evaluating text-to-image models. We perform a human evalua...
The popularity of multimodal large language models (MLLMs) has triggered a recent surge in research ...
Large language models (LLMs) have garnered significant attention, but the definition of "large" lack...
We present Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 lang...
Recent developments in large language models (LLMs) have shown promise in enhancing the capabilities...
As large language models (LLMs) continue to advance, accurately and comprehensively evaluating their...
Scaling language models with more data, compute and parameters has driven significant progress in na...
Large Language Models (LLMs) have demonstrated remarkable performance on various quantitative reason...
Large Language Models (LLMs) have demonstrated remarkable performance on various quantitative reason...
As the performance of large language models rapidly improves, benchmarks are getting larger and more...
Recently, large language models (LLMs), including notable models such as GPT-4 and burgeoning commun...
The development of large language models (LLMs) such as ChatGPT has brought a lot of attention recen...
Large language models (LLMs) are a special class of pretrained language models obtained by scaling m...
Large Language Models (LLMs) have demonstrated impressive performance on Natural Language Processing...
With ChatGPT-like large language models (LLM) prevailing in the community, how to evaluate the abili...
We provide a new multi-task benchmark for evaluating text-to-image models. We perform a human evalua...