Visual attention mechanism has been widely used by image captioning model in order to dynamically attend to the related visual region based on given language information. Such capability allows a trained model to carry out fine-grained level image understanding and reasoning. However, existing visual attention models only focus on the individual visual region in the image and the alignment between the language representation and related individual visual regions. It does not fully explore the relationships/interactions between visual regions. Furthermore, it does not analyze or explore alignment for related words/phrases (e.g. verb or phrasal verb), which may best describe the relationships/interactions between these visual regions. Thus, i...
Image captioning is the task of automatically generating a description of an image. Traditional imag...
Automatic image caption prediction is a challenging task in natural language processing. Most of the...
The image captioning is to describe an image with natural language as human, which has benefited fro...
University of Technology Sydney. Faculty of Engineering and Information Technology.Scene understandi...
International audienceWe propose ``Areas of Attention'', a novel attention-based model for automatic...
| openaire: EC/H2020/780069/EU//MeMADDense captioning (DC), which provides a comprehensive context u...
This paper appeared in the AAAI-98 Workshop on Representations for Multi-Modal Human-Computer Inter...
An image can be considered as a collection of small regions. Most researches of image understanding ...
Image captioning and visual language grounding are two important tasks for image understanding, but ...
Object detection, visual relationship detection, and image captioning, which are the three main visu...
Given an unstructured collection of captioned images of cluttered scenes featuring a variety of obje...
Visual attention plays an important role to understand images and demonstrates its effectiveness in ...
Visual attention plays an important role to understand images and demonstrates its effectiveness in ...
Advanced image-based application systems such as image retrieval and visual question answering depen...
This paper describes a set of methods to link entities across images and text. As a corpus, we used ...
Image captioning is the task of automatically generating a description of an image. Traditional imag...
Automatic image caption prediction is a challenging task in natural language processing. Most of the...
The image captioning is to describe an image with natural language as human, which has benefited fro...
University of Technology Sydney. Faculty of Engineering and Information Technology.Scene understandi...
International audienceWe propose ``Areas of Attention'', a novel attention-based model for automatic...
| openaire: EC/H2020/780069/EU//MeMADDense captioning (DC), which provides a comprehensive context u...
This paper appeared in the AAAI-98 Workshop on Representations for Multi-Modal Human-Computer Inter...
An image can be considered as a collection of small regions. Most researches of image understanding ...
Image captioning and visual language grounding are two important tasks for image understanding, but ...
Object detection, visual relationship detection, and image captioning, which are the three main visu...
Given an unstructured collection of captioned images of cluttered scenes featuring a variety of obje...
Visual attention plays an important role to understand images and demonstrates its effectiveness in ...
Visual attention plays an important role to understand images and demonstrates its effectiveness in ...
Advanced image-based application systems such as image retrieval and visual question answering depen...
This paper describes a set of methods to link entities across images and text. As a corpus, we used ...
Image captioning is the task of automatically generating a description of an image. Traditional imag...
Automatic image caption prediction is a challenging task in natural language processing. Most of the...
The image captioning is to describe an image with natural language as human, which has benefited fro...