Computer Vision has undergone major changes over the recent five years. Here, we investigate if the performance of such architectures generalizes to more complex tasks that require a more holistic approach to scene comprehension. The presented work focuses on learning spatial and multi-modal representations, and the foundations of a Visual Turing Test, where the scene understanding is tested by a series of questions about its content. In our studies, we propose DAQUAR, the first ‘question answering about real-world images’ dataset together with methods, termed a symbolic-based and a neural-based visual question answering architectures, that address the problem. The symbolic-based method relies on a semantic parser, a database of visual fact...
Progress in language and image understanding by machines has sparkled the interest of the research c...
In this paper, we propose a novel multi-modal framework for Scene Text Visual Question Answering (ST...
This paper proposes to improve visual question answering (VQA) with structured representations of bo...
Computer Vision has undergone major changes over the recent five years. Here, we investigate if the ...
Computer Vision is a scientific discipline which involves the development of an algorithmic basis fo...
Many vision and language tasks require commonsense reasoning beyond data-driven image and natural la...
Together with the development of more accurate methods in Computer Vision and Natural Language Under...
Visual Question Answering (VQA) is an extremely stimulating and challenging research area where Comp...
Visual Question Answering (VQA) is the task of answering questions based on an image. The field has ...
One of the most intriguing features of the Visual Question Answering (VQA) challenge is the unpredic...
We propose a method for visual question answering which combines an internal representation of the c...
The computer vision community has been long focusing on classic tasks such as object detection, huma...
As language and visual understanding by machines progresses rapidly, we are observing an increasing ...
Visual question answering (VQA) demands simultaneous comprehension of both the image visual content ...
There has been immense progress in the fields of computer vision, object detection and natural langu...
Progress in language and image understanding by machines has sparkled the interest of the research c...
In this paper, we propose a novel multi-modal framework for Scene Text Visual Question Answering (ST...
This paper proposes to improve visual question answering (VQA) with structured representations of bo...
Computer Vision has undergone major changes over the recent five years. Here, we investigate if the ...
Computer Vision is a scientific discipline which involves the development of an algorithmic basis fo...
Many vision and language tasks require commonsense reasoning beyond data-driven image and natural la...
Together with the development of more accurate methods in Computer Vision and Natural Language Under...
Visual Question Answering (VQA) is an extremely stimulating and challenging research area where Comp...
Visual Question Answering (VQA) is the task of answering questions based on an image. The field has ...
One of the most intriguing features of the Visual Question Answering (VQA) challenge is the unpredic...
We propose a method for visual question answering which combines an internal representation of the c...
The computer vision community has been long focusing on classic tasks such as object detection, huma...
As language and visual understanding by machines progresses rapidly, we are observing an increasing ...
Visual question answering (VQA) demands simultaneous comprehension of both the image visual content ...
There has been immense progress in the fields of computer vision, object detection and natural langu...
Progress in language and image understanding by machines has sparkled the interest of the research c...
In this paper, we propose a novel multi-modal framework for Scene Text Visual Question Answering (ST...
This paper proposes to improve visual question answering (VQA) with structured representations of bo...