OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning

(Accepted to CVPR 2024)


Haiyang Ying, Yixuan Yin, Jinzhi Zhang, Fan Wang, Tao Yu, Ruqi Huang, Lu Fang,

Tsinghua University

ArXiv Paper
Code (available now!)

Overview



We propose an omniversal 3D segmentation method, which (a) takes as input multi-view, inconsistent, class-agnostic 2D segmentations, and then outputs a consistent 3D feature field via a hierarchical contrastive learning framework. This method supports (b) hierarchical segmentation, (c) multi-object selection, and (d) holistic discretization in an interactive manner. The feature map is visualized by applying PCA on the original N-D feature, and similar color means high feature similarity. The score map reveals similarity wrt a chosen point, where dark red means high similarity while dark blue means low similarity.


Global Discretization Performance

With the global consistent 3D feature field optimized by OmniSeg3D, we can segment anything without the restrictions on the amount and categories of objects. In this demo, challenging parts of Tengwang Pavilion, a traditional chinese architecture, can be segmented in an interactive manner conveniently.


Representation


We propose a novel hierarchical representation based on the masks given by click-based 2D segmentation methods. (a) For each image, click-based 2D segmentors provide a set of 2D binary masks. (b) Directly overlapping masks implemented by conventional methods (like SAM) lead to the loss of hierarchical information. (c) While our patch-based modeling effectively preserves the hierarchical relationship between pixels. The hierarchical representation of each image includes a patch index map and a correlation matrix, where the relevance between the anchor patch and other patches is evaluated via a voting strategy.


Framework


We propose a framework of hierarchical contrastive learning in 3D space for feature field optimization. (a) For each input RGB image, we apply (b) 2D hierarchical modeling to get a patch index map and a correlation matrix. During training, we utilize (c) NeRF-based (or mesh-based) rendering pipeline to render features from 3D space and apply hierarchical contrastive learning (d) to the rendered features to optimize the feature field for segmentation.



Interactive Segmentation


We show that our method is capable of interactive 3D segmentation in real-time, including hierarchical inference and multi-object selection.

Interactive hierarchical inference - Replica Room-0



Interactive Multi-object Selection - Tengwang Pavilion



Interactive Multi-object Selection - Replica Room-0




Automatic Discretization


After optimization, the feature in the 3D field can be distilled onto mesh for automatic global segmentation without any human intervention as shown below. However, Due to the absence of a clear definition for hierarchy levels, there is no assurance that the objects will be segmented at the same level by simply clustering features. To address this issue, text-aligned hierarchical segmentation may be a future direction.



Conclusion


In this paper, we propose OmniSeg3D, an omniversal segmentation method that facilitates holistic understanding of 3D scenes. Leveraging a hierarchical representation and a hierarchical contrastive learning framework, OmniSeg3D effectively transforms inconsistent 2D segmentations into a globally consistent 3D feature field while retaining hierarchical information, which enables correct hierarchical 3D sensing and high-quality object segmentation performance. Besides, variant interactive functionalities including hierarchical inference, multi-object selection, and global discretization are realized, which may further enable downstream applications in the field of 3D data annotation, robotics and virtual reality.