The open-ended question answering task of Text-VQA often requires reading and reasoning about rarely seen or completely unseen scene-text content of an image. We address this zero-shot nature of the problem by proposing the generalized use of external knowledge to augment our understanding of the scene text. We design a framework to extract, validate, and reason with knowledge using a standard multimodal transformer for vision language understanding tasks. Through empirical evidence and qualitative results, we demonstrate how external knowledge can highlight instance-only cues and thus help deal with training data bias, improve answer entity type correctness, and detect multiword named entities. We generate results comparable to the state-o...
Knowledge-Based Visual Question Answering (KBVQA) is a bi-modal task requiring external world knowle...
Accurately answering a question about a given image requires combining observations with general kno...
One of the key limitations of traditional machine learning methods is their requirement for training...
Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answer...
The fields of computer vision and natural language processing have made significant advances in visual...
Texts in scene images convey critical information for scene understanding and reasoning. The abiliti...
In this paper, we propose a novel multi-modal framework for Scene Text Visual Question Answering (ST...
Visual Question Answering (VQA) is the task of answering questions based on an image. The field has ...
Text-VQA aims at answering questions that require understanding the textual cues in an image. Despit...
The primary focus of recent work with largescale transformers has been on optimizing the amount of i...
Given visual input and a natural language question about it, the visual question answering (VQA) tas...
Visual question answering (VQA) is challenging not only because the model has to handle multi-modal ...
Using deep learning, computer vision now rivals people at object recognition and detection, opening ...
Visual Question Answering (VQA) has emerged as an important problem spanning Computer Vision, Natura...
Visual question answering (VQA) demands simultaneous comprehension of both the image visual content ...
Knowledge-Based Visual Question Answering (KBVQA) is a bi-modal task requiring external world knowle...
Accurately answering a question about a given image requires combining observations with general kno...
One of the key limitations of traditional machine learning methods is their requirement for training...
Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answer...
The fields of computer vision and natural language processing have made significant advances in visual...
Texts in scene images convey critical information for scene understanding and reasoning. The abiliti...
In this paper, we propose a novel multi-modal framework for Scene Text Visual Question Answering (ST...
Visual Question Answering (VQA) is the task of answering questions based on an image. The field has ...
Text-VQA aims at answering questions that require understanding the textual cues in an image. Despit...
The primary focus of recent work with largescale transformers has been on optimizing the amount of i...
Given visual input and a natural language question about it, the visual question answering (VQA) tas...
Visual question answering (VQA) is challenging not only because the model has to handle multi-modal ...
Using deep learning, computer vision now rivals people at object recognition and detection, opening ...
Visual Question Answering (VQA) has emerged as an important problem spanning Computer Vision, Natura...
Visual question answering (VQA) demands simultaneous comprehension of both the image visual content ...
Knowledge-Based Visual Question Answering (KBVQA) is a bi-modal task requiring external world knowle...
Accurately answering a question about a given image requires combining observations with general kno...
One of the key limitations of traditional machine learning methods is their requirement for training...