Exploiting relationships between objects for image and video captioning has received increasing attention. Most existing methods depend heavily on pre-trained detectors of objects and their relationships, and thus may not work well when facing detection challenges such as heavy occlusion, tiny-size objects, and long-tail classes. In this paper, we propose a joint commonsense and relation reasoning method that exploits prior knowledge for image and video captioning without relying on any detectors. The prior knowledge provides semantic correlations and constraints between objects, serving as guidance to build semantic graphs that summarize object relationships, some of which cannot be directly perceived from images or videos. Particularly, o...
Scene graph generation has received growing attention with the advancements in image understanding t...
Video captioning has become a broad and interesting research area. Attention-based encoder-decoder m...
Neural networks have been shown effective at learning rich low-dimensional representations of high-d...
Visual relationship detection aims to completely understand visual scenes and has recently received ...
Object detection, visual relationship detection, and image captioning, which are the three main visu...
Image captioning is shown to be able to achieve a better performance by using scene graphs to repres...
Scene graph parsing aims at understanding an image as a graph where vertices are visual objects (pot...
The computer vision community has been long focusing on classic tasks such as object detection, huma...
How to select relevant key objects and reason about the complex relationships cross vision and lingu...
Semantic Image Interpretation is the task of extracting a structured semantic description from image...
Visual relationship detection is fundamental for holistic image understanding. However, the localiza...
Visual relationship detection is fundamental for holistic image understanding. However, the localiza...
Image captioning is the process of analyzing an image and generating a textual description according...
Cross-view video understanding is an important yet under-explored area in computer vision. In this p...
International audienceThis paper introduces a novel approach for modeling visual relations between p...
Scene graph generation has received growing attention with the advancements in image understanding t...
Video captioning has become a broad and interesting research area. Attention-based encoder-decoder m...
Neural networks have been shown effective at learning rich low-dimensional representations of high-d...
Visual relationship detection aims to completely understand visual scenes and has recently received ...
Object detection, visual relationship detection, and image captioning, which are the three main visu...
Image captioning is shown to be able to achieve a better performance by using scene graphs to repres...
Scene graph parsing aims at understanding an image as a graph where vertices are visual objects (pot...
The computer vision community has been long focusing on classic tasks such as object detection, huma...
How to select relevant key objects and reason about the complex relationships cross vision and lingu...
Semantic Image Interpretation is the task of extracting a structured semantic description from image...
Visual relationship detection is fundamental for holistic image understanding. However, the localiza...
Visual relationship detection is fundamental for holistic image understanding. However, the localiza...
Image captioning is the process of analyzing an image and generating a textual description according...
Cross-view video understanding is an important yet under-explored area in computer vision. In this p...
International audienceThis paper introduces a novel approach for modeling visual relations between p...
Scene graph generation has received growing attention with the advancements in image understanding t...
Video captioning has become a broad and interesting research area. Attention-based encoder-decoder m...
Neural networks have been shown effective at learning rich low-dimensional representations of high-d...