The two popular datasets ScanRefer [16] and ReferIt3D [3] connect natural language to real-world 3D data. In this paper, we curate a large-scale and complementary dataset extending both the aforementioned ones by associating all objects mentioned in a referential sentence to their underlying instances inside a 3D scene. Specifically, our Scan Entities in 3D (ScanEnts3D) dataset provides explicit correspondences between 369k objects across 84k natural referential sentences, covering 705 real-world scenes. Crucially, we show that by incorporating intuitive losses that enable learning from this novel dataset, we can significantly improve the performance of several recently introduced neural listening architectures, including improving the SoTA...
The unprecedented advancements in Large Language Models (LLMs) have created a profound impact on nat...
This paper presents the integration of natural language processing and computer vision to improve th...
We present Neural Feature Fusion Fields (N3F), a method that improves dense 2D image feature extract...
Recent studies on dense captioning and visual grounding in 3D have achieved impressive results. Desp...
We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natur...
Recent progress on 3D scene understanding has explored visual grounding (3DVG) to localize a target ...
Recent advances in 3D semantic segmentation with deep neural networks have shown remarkable success,...
We present the Habitat-Matterport 3D Semantics (HM3DSEM) dataset. HM3DSEM is the largest dataset of ...
We combine neural rendering with multi-modal image and text representations to synthesize diverse 3D...
Text-to-scene generation systems take input in the form of a natural language text and output a 3D s...
3D graphics scenes are difficult to create, requiring users to learn and utilize a series of complex...
Many basic indoor activities such as eating or writing are always conducted upon different tabletops...
Learning descriptive 3D features is crucial for understanding 3D scenes with diverse objects and com...
People can describe spatial scenes with language and, vice versa, create images based on linguistic ...
Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences wi...
The unprecedented advancements in Large Language Models (LLMs) have created a profound impact on nat...
This paper presents the integration of natural language processing and computer vision to improve th...
We present Neural Feature Fusion Fields (N3F), a method that improves dense 2D image feature extract...
Recent studies on dense captioning and visual grounding in 3D have achieved impressive results. Desp...
We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natur...
Recent progress on 3D scene understanding has explored visual grounding (3DVG) to localize a target ...
Recent advances in 3D semantic segmentation with deep neural networks have shown remarkable success,...
We present the Habitat-Matterport 3D Semantics (HM3DSEM) dataset. HM3DSEM is the largest dataset of ...
We combine neural rendering with multi-modal image and text representations to synthesize diverse 3D...
Text-to-scene generation systems take input in the form of a natural language text and output a 3D s...
3D graphics scenes are difficult to create, requiring users to learn and utilize a series of complex...
Many basic indoor activities such as eating or writing are always conducted upon different tabletops...
Learning descriptive 3D features is crucial for understanding 3D scenes with diverse objects and com...
People can describe spatial scenes with language and, vice versa, create images based on linguistic ...
Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences wi...
The unprecedented advancements in Large Language Models (LLMs) have created a profound impact on nat...
This paper presents the integration of natural language processing and computer vision to improve th...
We present Neural Feature Fusion Fields (N3F), a method that improves dense 2D image feature extract...