Visual Question Answering (VQA) is a recently proposed multimodal task in the general area of machine learning. The input to this task consists of a single image and an associated natural language question, and the output is the answer to that question. In this thesis we propose two incremental modifications to an existing model which won the VQA Challenge in 2016 using multimodal compact bilinear pooling (MCB), a novel way of combining modalities. First, we added the language attention mechanism, and on top of that we introduce an image attention mechanism focusing on objects detected in the image ("region attention"). We also experiment with ways of combining these in a single end- to-end model. The thesis describes the MCB model and our ...
Visual Question Answering (VQA) is the task of answering questions based on an image. The field has ...
Rich and dense human labeled datasets are among the main enabling factors for the recent advance on ...
Attention is a substantial mechanism for human to process massive data. It omits the trivial parts a...
© 2017 IEEE. Visual question answering (VQA) is challenging because it requires a simultaneous under...
© 2018 IEEE. Visual question answering (VQA) is challenging, because it requires a simultaneous unde...
This paper describes the contribution by participants from Umeå University, Sweden, in collaboration...
Recently, the Visual Question Answering (VQA) task has gained increasing attention in artificial int...
Visual Question Answering~(VQA) requires a simultaneous understanding of images and questions. Exist...
Visual Question Answering (VQA) is a task for evaluating image scene understanding abilities and sho...
Visual Question Answering (VQA) raises a great challenge for computer vision and natural language pr...
Visual question answering (VQA) demands simultaneous comprehension of both the image visual content ...
One of the most intriguing features of the Visual Question Answering (VQA) challenge is the unpredic...
Visual Question Answering (VQA) is an extremely stimulating and challenging research area where Comp...
In this paper, we provide external image features and use the internal attention mechanism to solve ...
CVPR2019 accepted paperInternational audienceMultimodal attentional networks are currently state-of-...
Visual Question Answering (VQA) is the task of answering questions based on an image. The field has ...
Rich and dense human labeled datasets are among the main enabling factors for the recent advance on ...
Attention is a substantial mechanism for human to process massive data. It omits the trivial parts a...
© 2017 IEEE. Visual question answering (VQA) is challenging because it requires a simultaneous under...
© 2018 IEEE. Visual question answering (VQA) is challenging, because it requires a simultaneous unde...
This paper describes the contribution by participants from Umeå University, Sweden, in collaboration...
Recently, the Visual Question Answering (VQA) task has gained increasing attention in artificial int...
Visual Question Answering~(VQA) requires a simultaneous understanding of images and questions. Exist...
Visual Question Answering (VQA) is a task for evaluating image scene understanding abilities and sho...
Visual Question Answering (VQA) raises a great challenge for computer vision and natural language pr...
Visual question answering (VQA) demands simultaneous comprehension of both the image visual content ...
One of the most intriguing features of the Visual Question Answering (VQA) challenge is the unpredic...
Visual Question Answering (VQA) is an extremely stimulating and challenging research area where Comp...
In this paper, we provide external image features and use the internal attention mechanism to solve ...
CVPR2019 accepted paperInternational audienceMultimodal attentional networks are currently state-of-...
Visual Question Answering (VQA) is the task of answering questions based on an image. The field has ...
Rich and dense human labeled datasets are among the main enabling factors for the recent advance on ...
Attention is a substantial mechanism for human to process massive data. It omits the trivial parts a...