We present a model that generates natural language de-scriptions of images and their regions. Our approach lever-ages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between lan-guage and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architec-ture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in re-trieval...
Abstract In numerous multimedia and multi-modal tasks from image and video retrieval to zero-shot r...
Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between ima...
Existing image-text matching approaches typically infer the similarity of an image-text pair by capt...
We present a model that generates natural language de-scriptions of images and their regions. Our ap...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Advanced image-based application systems such as image retrieval and visual question answering depen...
Cross-modal retrieval has attracted widespread attention in many cross-media similarity search appli...
Visual-semantic embeddings have been extensively used as a powerful model for cross-modal retrieval ...
The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fr...
In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel s...
We present a new approach for modeling multi-modal data sets, focusing on the specific case of segme...
We introduce a model for bidirectional retrieval of images and sentences through a multi-modal embed...
We presented a learning model that generated natural language description of images. The model utili...
Recurrent neural networks (RNN) have gained a reputation for producing state-of-the-art results on m...
Abstract In numerous multimedia and multi-modal tasks from image and video retrieval to zero-shot r...
Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between ima...
Existing image-text matching approaches typically infer the similarity of an image-text pair by capt...
We present a model that generates natural language de-scriptions of images and their regions. Our ap...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Advanced image-based application systems such as image retrieval and visual question answering depen...
Cross-modal retrieval has attracted widespread attention in many cross-media similarity search appli...
Visual-semantic embeddings have been extensively used as a powerful model for cross-modal retrieval ...
The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fr...
In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel s...
We present a new approach for modeling multi-modal data sets, focusing on the specific case of segme...
We introduce a model for bidirectional retrieval of images and sentences through a multi-modal embed...
We presented a learning model that generated natural language description of images. The model utili...
Recurrent neural networks (RNN) have gained a reputation for producing state-of-the-art results on m...
Abstract In numerous multimedia and multi-modal tasks from image and video retrieval to zero-shot r...
Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between ima...
Existing image-text matching approaches typically infer the similarity of an image-text pair by capt...