Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between images and natural language. Experimental results on public vision and language datasets have shown all these methods are able to obtain significant performance improvement on vision and language tasks such as image captioning and cross-modal retrieval.We, as humans, can easily use our vision and language capabilities to accomplish a wide variety of tasks that combine the image and the text modalities. However, it is not easy for machines because it requires the model to understand the image and language, especially how they relate to each other. In recent years considerable progress has been made in applying deep learning to computer vision and...
With the advancement in deep learning, the consolidation of text classification and image processing...
Tasks that require modeling of both language and visual information such as image captioning have be...
Understanding visual media, i.e. images and videos, has been a cornerstone topic in computer vision ...
Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between ima...
Developing intelligent agents that can perceive and understand the rich visual world around us has b...
This thesis focuses on proposing and addressing various tasks in the field of vision and language, a...
This thesis focuses on proposing and addressing various tasks in the field of vision and language, a...
Powered by deep convolutional networks and large scale visual datasets, modern computer vision syste...
Powered by deep convolutional networks and large scale visual datasets, modern computer vision syste...
Each time we ask for an object, describe a scene, follow directions or read a document containi...
This paper presents the integration of natural language processing and computer vision to improve th...
We introduce two multimodal neural lan-guage models: models of natural language that can be conditio...
Computer vision is moving from predicting discrete, categorical labels to generating rich descriptio...
Automatically describing the content of an image is a fundamental problem in artificial intelligence...
A long standing goal of artificial intelligence is to enable machines to perceive the visual world a...
With the advancement in deep learning, the consolidation of text classification and image processing...
Tasks that require modeling of both language and visual information such as image captioning have be...
Understanding visual media, i.e. images and videos, has been a cornerstone topic in computer vision ...
Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between ima...
Developing intelligent agents that can perceive and understand the rich visual world around us has b...
This thesis focuses on proposing and addressing various tasks in the field of vision and language, a...
This thesis focuses on proposing and addressing various tasks in the field of vision and language, a...
Powered by deep convolutional networks and large scale visual datasets, modern computer vision syste...
Powered by deep convolutional networks and large scale visual datasets, modern computer vision syste...
Each time we ask for an object, describe a scene, follow directions or read a document containi...
This paper presents the integration of natural language processing and computer vision to improve th...
We introduce two multimodal neural lan-guage models: models of natural language that can be conditio...
Computer vision is moving from predicting discrete, categorical labels to generating rich descriptio...
Automatically describing the content of an image is a fundamental problem in artificial intelligence...
A long standing goal of artificial intelligence is to enable machines to perceive the visual world a...
With the advancement in deep learning, the consolidation of text classification and image processing...
Tasks that require modeling of both language and visual information such as image captioning have be...
Understanding visual media, i.e. images and videos, has been a cornerstone topic in computer vision ...