Current multimodal data processing methods use deep learning to combine complementary visual and textual information. Although large neural networks can be useful, they are often computationally intensive and require very large training datasets. This thesis explores a novel and direct method for predicting useful and understandable language context from the visual information within a video. Each video is represented by a scene-weighted sum of deep feature vectors across the video. Then the method predicts a novel cluster-weighted term frequency-inverse document frequency (TF–IDF) vector for a test video by averaging the TF–IDF vectors of the K training videos having the most similar visual features. The predicted vector provides informati...
Linking natural language to visual data is an important topic at the intersection of Natural Languag...
In this paper, we introduce How2, a multimodal collection of instructional videos with English subti...
We propose a novel method for learning visual concepts and their correspondence to the words of a na...
Current multimodal data processing methods use deep learning to combine complementary visual and tex...
The field of computer vision has long strived to extract understanding from images and videos sequen...
A long standing goal of artificial intelligence is to enable machines to perceive the visual world a...
This paper integrates techniques in natural language processing and computer vision to improve recog...
For most people, watching a brief video and describing what happened (in words) is an easy task. For...
University of Technology Sydney. Faculty of Engineering and Information Technology.Video understandi...
Video captioning refers to the process of conveying information of video clips through automatically...
People typically learn through exposure to visual facts associated with linguistic descriptions. For...
Video captioning refers to the task of generating a natural language sentence that explains the cont...
Generating natural language descriptions for visual data links computer vision and computational lin...
Vision to language problems, such as video annotation, or visual question answering, stand out from ...
Deep learning has resulted in ground-breaking progress in a variety of domains, from core machine le...
Linking natural language to visual data is an important topic at the intersection of Natural Languag...
In this paper, we introduce How2, a multimodal collection of instructional videos with English subti...
We propose a novel method for learning visual concepts and their correspondence to the words of a na...
Current multimodal data processing methods use deep learning to combine complementary visual and tex...
The field of computer vision has long strived to extract understanding from images and videos sequen...
A long standing goal of artificial intelligence is to enable machines to perceive the visual world a...
This paper integrates techniques in natural language processing and computer vision to improve recog...
For most people, watching a brief video and describing what happened (in words) is an easy task. For...
University of Technology Sydney. Faculty of Engineering and Information Technology.Video understandi...
Video captioning refers to the process of conveying information of video clips through automatically...
People typically learn through exposure to visual facts associated with linguistic descriptions. For...
Video captioning refers to the task of generating a natural language sentence that explains the cont...
Generating natural language descriptions for visual data links computer vision and computational lin...
Vision to language problems, such as video annotation, or visual question answering, stand out from ...
Deep learning has resulted in ground-breaking progress in a variety of domains, from core machine le...
Linking natural language to visual data is an important topic at the intersection of Natural Languag...
In this paper, we introduce How2, a multimodal collection of instructional videos with English subti...
We propose a novel method for learning visual concepts and their correspondence to the words of a na...