Embodied Question Answering (EQA) is a recently proposed task, where an agent is placed in a rich 3D environment and must act based solely on its egocentric input to answer a given question. The desired outcome is that the agent learns to combine capabilities such as scene understanding, navigation and language understanding in order to perform complex reasoning in the visual world. However, initial advancements combining standard vision and language methods with imitation and reinforcement learning algorithms have shown EQA might be too complex and challenging for these techniques. In order to investigate the feasibility of EQA-type tasks, we build the VideoNavQA dataset that contains pairs of questions and videos generated in the House3D ...
Visual Question Answering (VQA) raises a great challenge for computer vision and natural language pr...
We propose a new 3D spatial understanding task of 3D Question Answering (3D-QA). In the 3D-QA task, ...
Visual Question Answering (VQA) has attracted much attention in both computer vision and natural lan...
Video Question Answering (VideoQA) is a task that requires a model to analyze and understand both th...
Visual Question Answering (VQA) is an extremely stimulating and challenging research area where Comp...
Video Question Answering (VideoQA) aims to answer natural language questions according to the given ...
In recent years, visual question answering (VQA) has become topical. The premise of VQA's significan...
Embodied question answering is the task of asking a robot about objects in a 3D environment. The rob...
Given visual input and a natural language question about it, the visual question answering (VQA) tas...
Visual Question Answering (VQA) has witnessed tremendous progress in recent years. However, most eff...
We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answeri...
Visual Question Answering (VQA) is the task of answering questions based on an image. The field has ...
Our code, datasets and trained models are available at https://antoyang.github.io/just-ask.html.Inte...
Abstract—We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an i...
We propose a scalable approach to learn video-based question answering (QA): to answer a free-form n...
Visual Question Answering (VQA) raises a great challenge for computer vision and natural language pr...
We propose a new 3D spatial understanding task of 3D Question Answering (3D-QA). In the 3D-QA task, ...
Visual Question Answering (VQA) has attracted much attention in both computer vision and natural lan...
Video Question Answering (VideoQA) is a task that requires a model to analyze and understand both th...
Visual Question Answering (VQA) is an extremely stimulating and challenging research area where Comp...
Video Question Answering (VideoQA) aims to answer natural language questions according to the given ...
In recent years, visual question answering (VQA) has become topical. The premise of VQA's significan...
Embodied question answering is the task of asking a robot about objects in a 3D environment. The rob...
Given visual input and a natural language question about it, the visual question answering (VQA) tas...
Visual Question Answering (VQA) has witnessed tremendous progress in recent years. However, most eff...
We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answeri...
Visual Question Answering (VQA) is the task of answering questions based on an image. The field has ...
Our code, datasets and trained models are available at https://antoyang.github.io/just-ask.html.Inte...
Abstract—We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an i...
We propose a scalable approach to learn video-based question answering (QA): to answer a free-form n...
Visual Question Answering (VQA) raises a great challenge for computer vision and natural language pr...
We propose a new 3D spatial understanding task of 3D Question Answering (3D-QA). In the 3D-QA task, ...
Visual Question Answering (VQA) has attracted much attention in both computer vision and natural lan...