Texts in scene images convey critical information for scene understanding and reasoning. The abilities of reading and reasoning matter for the model in the text-based visual question answering (TextVQA) process. However, current TextVQA models do not center on the text and suffer from several limitations. The model is easily dominated by language biases and optical character recognition (OCR) errors due to the absence of semantic guidance in the answer prediction process. In this paper, we propose a novel Semantics-Centered Network (SC-Net) that consists of an instance-level contrastive semantic prediction module (ICSP) and a semantics-centered transformer module (SCT). Equipped with the two modules, the semantics-centered model can resist ...
Visual question answering (VQA) demands simultaneous comprehension of both the image visual content ...
Visual Question Answering (VQA) methods have made incredible progress, but suffer from a failure to ...
Zero-shot Visual Question Answering (VQA) is a prominent vision-language task that examines both the...
The open-ended question answering task of Text-VQA often requires reading and reasoning about rarely...
In this paper, we propose a novel multi-modal framework for Scene Text Visual Question Answering (ST...
Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answer...
Visual Question Answering (VQA) is the task of answering questions based on an image. The field has ...
Using deep learning, computer vision now rivals people at object recognition and detection, opening ...
In the past few years, Visual Question Answering (VQA) has seen immense progress both in terms of ac...
Given visual input and a natural language question about it, the visual question answering (VQA) tas...
Since its appearance, Visual Question Answering (VQA, i.e. answering a question posed over an image)...
Text-VQA aims at answering questions that require understanding the textual cues in an image. Despit...
International audienceSince its inception, Visual Question Answering (VQA) is notoriously known as a...
Computer Vision has undergone major changes over the recent five years. Here, we investigate if the ...
The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles ...
Visual question answering (VQA) demands simultaneous comprehension of both the image visual content ...
Visual Question Answering (VQA) methods have made incredible progress, but suffer from a failure to ...
Zero-shot Visual Question Answering (VQA) is a prominent vision-language task that examines both the...
The open-ended question answering task of Text-VQA often requires reading and reasoning about rarely...
In this paper, we propose a novel multi-modal framework for Scene Text Visual Question Answering (ST...
Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answer...
Visual Question Answering (VQA) is the task of answering questions based on an image. The field has ...
Using deep learning, computer vision now rivals people at object recognition and detection, opening ...
In the past few years, Visual Question Answering (VQA) has seen immense progress both in terms of ac...
Given visual input and a natural language question about it, the visual question answering (VQA) tas...
Since its appearance, Visual Question Answering (VQA, i.e. answering a question posed over an image)...
Text-VQA aims at answering questions that require understanding the textual cues in an image. Despit...
International audienceSince its inception, Visual Question Answering (VQA) is notoriously known as a...
Computer Vision has undergone major changes over the recent five years. Here, we investigate if the ...
The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles ...
Visual question answering (VQA) demands simultaneous comprehension of both the image visual content ...
Visual Question Answering (VQA) methods have made incredible progress, but suffer from a failure to ...
Zero-shot Visual Question Answering (VQA) is a prominent vision-language task that examines both the...