Visual question answering (VQA) is challenging not only because the model has to handle multi-modal information, but also because it is just so hard to collect sufficient training examples -- there are too many questions one can ask about an image. As a result, a VQA model trained solely on human-annotated examples could easily over-fit specific question styles or image contents that are being asked, leaving the model largely ignorant about the sheer diversity of questions. Existing methods address this issue primarily by introducing an auxiliary task such as visual grounding, cycle consistency, or debiasing. In this paper, we take a drastically different approach. We found that many of the "unknowns" to the learned VQA model are indeed "kn...
Visual Question Answering (VQA) has attracted much attention in both computer vision and natural lan...
Using deep learning, computer vision now rivals people at object recognition and detection, opening ...
Visual question answering (VQA) models, in particular modular ones, are commonly trained on large-sc...
Despite the great progress of Visual Question Answering (VQA), current VQA models heavily rely on th...
Given visual input and a natural language question about it, the visual question answering (VQA) tas...
One of the most intriguing features of the Visual Question Answering (VQA) challenge is the unpredic...
Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answer...
Text-VQA aims at answering questions that require understanding the textual cues in an image. Despit...
The open-ended question answering task of Text-VQA often requires reading and reasoning about rarely...
Zero-shot Visual Question Answering (VQA) is a prominent vision-language task that examines both the...
In recent years, visual question answering (VQA) has become topical. The premise of VQA's significan...
Visual Question Answering (VQA) is the task of answering questions based on an image. The field has ...
One of the key limitations of traditional machine learning methods is their requirement for training...
Generalization beyond the training distribution is a core challenge in machine learning. The common ...
Models for Visual Question Answering (VQA) are notorious for their tendency to rely on dataset biase...
Visual Question Answering (VQA) has attracted much attention in both computer vision and natural lan...
Using deep learning, computer vision now rivals people at object recognition and detection, opening ...
Visual question answering (VQA) models, in particular modular ones, are commonly trained on large-sc...
Despite the great progress of Visual Question Answering (VQA), current VQA models heavily rely on th...
Given visual input and a natural language question about it, the visual question answering (VQA) tas...
One of the most intriguing features of the Visual Question Answering (VQA) challenge is the unpredic...
Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answer...
Text-VQA aims at answering questions that require understanding the textual cues in an image. Despit...
The open-ended question answering task of Text-VQA often requires reading and reasoning about rarely...
Zero-shot Visual Question Answering (VQA) is a prominent vision-language task that examines both the...
In recent years, visual question answering (VQA) has become topical. The premise of VQA's significan...
Visual Question Answering (VQA) is the task of answering questions based on an image. The field has ...
One of the key limitations of traditional machine learning methods is their requirement for training...
Generalization beyond the training distribution is a core challenge in machine learning. The common ...
Models for Visual Question Answering (VQA) are notorious for their tendency to rely on dataset biase...
Visual Question Answering (VQA) has attracted much attention in both computer vision and natural lan...
Using deep learning, computer vision now rivals people at object recognition and detection, opening ...
Visual question answering (VQA) models, in particular modular ones, are commonly trained on large-sc...