We propose a novel method for learning visual concepts and their correspondence to the words of a natural language. The concepts and correspondences are jointly inferred from video clips depicting simple actions involving multiple objects, together with corresponding natural language commands that would elicit these actions. Individual objects are first detected, together with quantitative measurements of their colour, shape, location and motion. Visual concepts emerge from the co-occurrence of regions within a measurement space and words of the language. The method is evaluated on a set of videos generated automatically using computer graphics from a database of initial and goal configurations of objects. Each video is annotated with multi...
We present a cognitively plausible system capable of acquiring knowledge in language and vision from...
Despite the availability of a huge amount of video data accompanied by descriptive texts, it is not ...
We present a cognitively plausible novel framework capable of learning the grounding in visual seman...
We propose a novel method for learning visual concepts and their correspondence to the words of a na...
Linking natural language to visual data is an important topic at the intersection of Natural Languag...
Generating natural language descriptions for visual data links computer vision and computational lin...
This electronic version was submitted by the student author. The certified thesis is available in th...
People typically learn through exposure to visual facts associated with linguistic descriptions. For...
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Com...
There has been continuous growth in the volume and ubiquity of video material. It has become essenti...
This paper integrates techniques in natural language processing and computer vision to improve recog...
This paper introduces an open computational framework for visual perception and grounded language ac...
Vision to language problems, such as video annotation, or visual question answering, stand out from ...
We present a system that enables embodied agents to learn about different components of the perceive...
Recent work has shown that the integration of visual information into text-based models can substant...
We present a cognitively plausible system capable of acquiring knowledge in language and vision from...
Despite the availability of a huge amount of video data accompanied by descriptive texts, it is not ...
We present a cognitively plausible novel framework capable of learning the grounding in visual seman...
We propose a novel method for learning visual concepts and their correspondence to the words of a na...
Linking natural language to visual data is an important topic at the intersection of Natural Languag...
Generating natural language descriptions for visual data links computer vision and computational lin...
This electronic version was submitted by the student author. The certified thesis is available in th...
People typically learn through exposure to visual facts associated with linguistic descriptions. For...
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Com...
There has been continuous growth in the volume and ubiquity of video material. It has become essenti...
This paper integrates techniques in natural language processing and computer vision to improve recog...
This paper introduces an open computational framework for visual perception and grounded language ac...
Vision to language problems, such as video annotation, or visual question answering, stand out from ...
We present a system that enables embodied agents to learn about different components of the perceive...
Recent work has shown that the integration of visual information into text-based models can substant...
We present a cognitively plausible system capable of acquiring knowledge in language and vision from...
Despite the availability of a huge amount of video data accompanied by descriptive texts, it is not ...
We present a cognitively plausible novel framework capable of learning the grounding in visual seman...