Recently, joint video-language modeling has been attracting more and more attention. However, most existing approaches focus on exploring the language model upon on a fixed vi-sual model. In this paper, we propose a unified framework that jointly models video and the corresponding text sen-tences. The framework consists of three parts: a composi-tional semantics language model, a deep video model and a joint embedding model. In our language model, we propose a dependency-tree structure model that embeds sentence into a continuous vector space, which preserves visually grounded meanings and word order. In the visual model, we leverage deep neural networks to capture essential semantic informa-tion from videos. In the joint embedding model, w...
We address the problem of text-based activity retrieval in video. Given a sentence describing an act...
This paper strives to find amidst a set of sentences the one best describing the content of a given ...
Humans can easily describe what they see in a coherent way and at varying level of detail. However, ...
Recently, joint video-language modeling has been attracting more and more attention. However, most e...
Linking natural language to visual data is an important topic at the intersection of Natural Languag...
This paper integrates techniques in natural language processing and computer vision to improve recog...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
Humans use rich natural language to describe and com-municate visual perceptions. In order to provid...
Generating natural language descriptions for visual data links computer vision and computational lin...
Abstract Automatically describing video content with natural language is a fundamental challenge of ...
In this thesis, we propose novel deep learning algorithms for the vision and language tasks, includi...
Humans use rich natural language to describe and communicate visual perceptions. In order to provide...
Video captioning refers to the task of generating a natural language sentence that explains the cont...
Vision to language problems, such as video annotation, or visual question answering, stand out from ...
Modeling visual context and its corresponding text description with a joint embedding network has be...
We address the problem of text-based activity retrieval in video. Given a sentence describing an act...
This paper strives to find amidst a set of sentences the one best describing the content of a given ...
Humans can easily describe what they see in a coherent way and at varying level of detail. However, ...
Recently, joint video-language modeling has been attracting more and more attention. However, most e...
Linking natural language to visual data is an important topic at the intersection of Natural Languag...
This paper integrates techniques in natural language processing and computer vision to improve recog...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
Humans use rich natural language to describe and com-municate visual perceptions. In order to provid...
Generating natural language descriptions for visual data links computer vision and computational lin...
Abstract Automatically describing video content with natural language is a fundamental challenge of ...
In this thesis, we propose novel deep learning algorithms for the vision and language tasks, includi...
Humans use rich natural language to describe and communicate visual perceptions. In order to provide...
Video captioning refers to the task of generating a natural language sentence that explains the cont...
Vision to language problems, such as video annotation, or visual question answering, stand out from ...
Modeling visual context and its corresponding text description with a joint embedding network has be...
We address the problem of text-based activity retrieval in video. Given a sentence describing an act...
This paper strives to find amidst a set of sentences the one best describing the content of a given ...
Humans can easily describe what they see in a coherent way and at varying level of detail. However, ...