Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. It has earned increasing attention with recent research trends in joint vision and language understanding. Yet, compared with ImageQA, VideoQA is largely underexplored and progresses slowly. Although different algorithms have continually been proposed and shown success on different VideoQA datasets, we find that there lacks a meaningful survey to categorize them, which seriously impedes its advancements. This paper thus provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges. We then point out the research trend of studying beyond factoid QA to inference QA towards t...
We introduce a new task, named video corpus visual answer localization (VCVAL), which aims to locate...
We propose a novel video understanding task by fusing knowledge-based and video question answering. ...
We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both vide...
Video question answering is a challenging task that requires understanding jointly the language inpu...
We propose a scalable approach to learn video-based question answering (QA): to answer a free-form n...
Given visual input and a natural language question about it, the visual question answering (VQA) tas...
International audience—Recent methods for visual question answering rely on large-scale annotated da...
Our code, datasets and trained models are available at https://antoyang.github.io/just-ask.html.Inte...
Embodied Question Answering (EQA) is a recently proposed task, where an agent is placed in a rich 3D...
Video Question Answering (VideoQA) is a task that requires a model to analyze and understand both th...
To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in visi...
While recent large-scale video-language pre-training made great progress in video question answering...
Recent developments in modeling language and vision have been successfully applied to image question...
Video Question Answering (VideoQA) requires fine-grained understanding of both video and language mo...
Visual Question Answering (VQA) is an extremely stimulating and challenging research area where Comp...
We introduce a new task, named video corpus visual answer localization (VCVAL), which aims to locate...
We propose a novel video understanding task by fusing knowledge-based and video question answering. ...
We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both vide...
Video question answering is a challenging task that requires understanding jointly the language inpu...
We propose a scalable approach to learn video-based question answering (QA): to answer a free-form n...
Given visual input and a natural language question about it, the visual question answering (VQA) tas...
International audience—Recent methods for visual question answering rely on large-scale annotated da...
Our code, datasets and trained models are available at https://antoyang.github.io/just-ask.html.Inte...
Embodied Question Answering (EQA) is a recently proposed task, where an agent is placed in a rich 3D...
Video Question Answering (VideoQA) is a task that requires a model to analyze and understand both th...
To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in visi...
While recent large-scale video-language pre-training made great progress in video question answering...
Recent developments in modeling language and vision have been successfully applied to image question...
Video Question Answering (VideoQA) requires fine-grained understanding of both video and language mo...
Visual Question Answering (VQA) is an extremely stimulating and challenging research area where Comp...
We introduce a new task, named video corpus visual answer localization (VCVAL), which aims to locate...
We propose a novel video understanding task by fusing knowledge-based and video question answering. ...
We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both vide...