Cross-view video understanding is an important yet under-explored area in computer vision. In this paper, we introduce a joint parsing framework that integrates view-centric proposals into scene-centric parse graphs that represent a coherent scene-centric understanding of cross-view scenes. Our key observations are that overlapping fields of views embed rich appearance and geometry correlations and that knowledge fragments corresponding to individual vision tasks are governed by consistency constraints available in commonsense knowledge. The proposed joint parsing framework represents such correlations and constraints explicitly and generates semantic scene-centric parse graphs. Quantitative experiments show that scene-centric predictions i...
We present an approach to jointly learn a set of view-specific dictionaries and a common dictionary ...
Abstract—We propose a multimedia analysis framework to process video and text jointly for understand...
In computer vision, scene parsing is the problem of labelling every pixel in an image or video with ...
The computer vision community has been long focusing on classic tasks such as object detection, huma...
Scene graph parsing aims at understanding an image as a graph where vertices are visual objects (pot...
In this paper, we propose a Spatio-temporal Attributed Parse Graph (ST-APG) to integrate semantic at...
Humans view the world through many sensory channels, e.g., the long-wavelength light channel, viewed...
Humans can extract rich information from visual scenes, such as the 3D locations of objects and huma...
Abstract—In some real world applications, like information retrieval and data classification, we oft...
Exploiting relationships between objects for image and video captioning has received increasing atte...
International audienceWhen watching the same visual stimulus, humans can exhibit a wide range of gaz...
It is a challenging yet crucial task to have a comprehensive understanding of human activities and e...
Real-world data is often multi-view, with each view representing a different perspective of the data...
In this thesis, we propose novel deep learning algorithms for the vision and language tasks, includi...
We describe a method for obtaining the principal objects, characters and scenes in a video by measur...
We present an approach to jointly learn a set of view-specific dictionaries and a common dictionary ...
Abstract—We propose a multimedia analysis framework to process video and text jointly for understand...
In computer vision, scene parsing is the problem of labelling every pixel in an image or video with ...
The computer vision community has been long focusing on classic tasks such as object detection, huma...
Scene graph parsing aims at understanding an image as a graph where vertices are visual objects (pot...
In this paper, we propose a Spatio-temporal Attributed Parse Graph (ST-APG) to integrate semantic at...
Humans view the world through many sensory channels, e.g., the long-wavelength light channel, viewed...
Humans can extract rich information from visual scenes, such as the 3D locations of objects and huma...
Abstract—In some real world applications, like information retrieval and data classification, we oft...
Exploiting relationships between objects for image and video captioning has received increasing atte...
International audienceWhen watching the same visual stimulus, humans can exhibit a wide range of gaz...
It is a challenging yet crucial task to have a comprehensive understanding of human activities and e...
Real-world data is often multi-view, with each view representing a different perspective of the data...
In this thesis, we propose novel deep learning algorithms for the vision and language tasks, includi...
We describe a method for obtaining the principal objects, characters and scenes in a video by measur...
We present an approach to jointly learn a set of view-specific dictionaries and a common dictionary ...
Abstract—We propose a multimedia analysis framework to process video and text jointly for understand...
In computer vision, scene parsing is the problem of labelling every pixel in an image or video with ...