We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video. Guided by ImageBind, we construct a joint embedding space between 3D and multi-modalities, enabling many promising applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D open-world understanding. On top of this, we further present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions. By parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction data, but exhibits superior 3D and multi-modal question-answering capacity. We hope our work may cast a light on the community for e...
Contrastive Language-Image Pre-training (CLIP) has shown promising open-world performance on 2D imag...
We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natur...
3D visual grounding aims to find the object within point clouds mentioned by free-form natural langu...
The unprecedented advancements in Large Language Models (LLMs) have created a profound impact on nat...
The understanding capabilities of current state-of-the-art 3D models are limited by datasets with a ...
Although recent point cloud analysis achieves impressive progress, the paradigm of representation le...
Some self-supervised cross-modal learning approaches have recently demonstrated the potential of ima...
The majority of point cloud registration methods currently rely on extracting features from points. ...
Multimodal registration is a challenging problem in visual computing, commonly faced during medical ...
A promising direction for pre-training 3D point clouds is to leverage the massive amount of data in ...
Pre-training across 3D vision and language remains under development because of limited training dat...
In this paper we explore the recent topic of point cloud completion, guided by an auxiliary image. W...
We present a novel, end-to-end learnable, multiview 3D point cloud registration algorithm. Registrat...
The past few years have witnessed the great success and prevalence of self-supervised representation...
When creating 3D content, highly specialized skills are generally needed to design and generate mode...
Contrastive Language-Image Pre-training (CLIP) has shown promising open-world performance on 2D imag...
We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natur...
3D visual grounding aims to find the object within point clouds mentioned by free-form natural langu...
The unprecedented advancements in Large Language Models (LLMs) have created a profound impact on nat...
The understanding capabilities of current state-of-the-art 3D models are limited by datasets with a ...
Although recent point cloud analysis achieves impressive progress, the paradigm of representation le...
Some self-supervised cross-modal learning approaches have recently demonstrated the potential of ima...
The majority of point cloud registration methods currently rely on extracting features from points. ...
Multimodal registration is a challenging problem in visual computing, commonly faced during medical ...
A promising direction for pre-training 3D point clouds is to leverage the massive amount of data in ...
Pre-training across 3D vision and language remains under development because of limited training dat...
In this paper we explore the recent topic of point cloud completion, guided by an auxiliary image. W...
We present a novel, end-to-end learnable, multiview 3D point cloud registration algorithm. Registrat...
The past few years have witnessed the great success and prevalence of self-supervised representation...
When creating 3D content, highly specialized skills are generally needed to design and generate mode...
Contrastive Language-Image Pre-training (CLIP) has shown promising open-world performance on 2D imag...
We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natur...
3D visual grounding aims to find the object within point clouds mentioned by free-form natural langu...