Recent studies on dense captioning and visual grounding in 3D have achieved impressive results. Despite developments in both areas, the limited amount of available 3D vision-language data causes overfitting issues for 3D visual grounding and 3D dense captioning methods. Also, how to discriminatively describe objects in complex 3D environments is not fully studied yet. To address these challenges, we present D3Net, an end-to-end neural speaker-listener architecture that can detect, describe and discriminate. Our D3Net unifies dense captioning and visual grounding in 3D in a self-critical manner. This self-critical property of D3Net also introduces discriminability during object caption generation and enables semi-supervised training on ScanN...
We present Neural Feature Fusion Fields (N3F), a method that improves dense 2D image feature extract...
Human perception reliably identifies movable and immovable parts of 3D scenes, and completes the 3D ...
Abstract Dense video captioning (DVC) detects multiple events in an input video and generates natura...
Recent advances in 3D semantic segmentation with deep neural networks have shown remarkable success,...
The two popular datasets ScanRefer [16] and ReferIt3D [3] connect natural language to real-world 3D ...
Localizing objects in 3D scenes according to the semantics of a given natural language is a fundamen...
Recent progress on 3D scene understanding has explored visual grounding (3DVG) to localize a target ...
Training 3D object detectors on publicly available data has been limited to small datasets due to t...
The task of object segmentation in videos is usually accomplished by processing appearance and motio...
Existing language grounding models often use object proposal bottlenecks: a pre-trained detector pro...
Captioning images is a challenging scene-understanding task that connects computer vision and natura...
We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natur...
3D visual grounding aims to find the object within point clouds mentioned by free-form natural langu...
Recent progress in 3D scene understanding enables scalable learning of representations across large ...
Automatic image captioning, a highly challenging research problem, aims to understand and describe t...
We present Neural Feature Fusion Fields (N3F), a method that improves dense 2D image feature extract...
Human perception reliably identifies movable and immovable parts of 3D scenes, and completes the 3D ...
Abstract Dense video captioning (DVC) detects multiple events in an input video and generates natura...
Recent advances in 3D semantic segmentation with deep neural networks have shown remarkable success,...
The two popular datasets ScanRefer [16] and ReferIt3D [3] connect natural language to real-world 3D ...
Localizing objects in 3D scenes according to the semantics of a given natural language is a fundamen...
Recent progress on 3D scene understanding has explored visual grounding (3DVG) to localize a target ...
Training 3D object detectors on publicly available data has been limited to small datasets due to t...
The task of object segmentation in videos is usually accomplished by processing appearance and motio...
Existing language grounding models often use object proposal bottlenecks: a pre-trained detector pro...
Captioning images is a challenging scene-understanding task that connects computer vision and natura...
We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natur...
3D visual grounding aims to find the object within point clouds mentioned by free-form natural langu...
Recent progress in 3D scene understanding enables scalable learning of representations across large ...
Automatic image captioning, a highly challenging research problem, aims to understand and describe t...
We present Neural Feature Fusion Fields (N3F), a method that improves dense 2D image feature extract...
Human perception reliably identifies movable and immovable parts of 3D scenes, and completes the 3D ...
Abstract Dense video captioning (DVC) detects multiple events in an input video and generates natura...