Breaking the 2D Dependency: What Limits 3D-Only Open-Vocabulary Scene Understanding

D'Orsi, Domenico; Carrara, Fabio; Falchi, Fabrizio; TONELLOTTO, NICOLA

doi:10.5281/zenodo.17338755

Published October 13, 2025 | Version v1

Conference paper Open

Breaking the 2D Dependency: What Limits 3D-Only Open-Vocabulary Scene Understanding

1. University of Pisa
2. Istituto di Scienza e Tecnologie dell'Informazione Alessandro Faedo Consiglio Nazionale delle Ricerche
3. National Research Council
4. University of Glasgow
5. Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo"

Accepted at CBMI 2025. Post-print version.

Open-vocabulary 3D scene understanding, i.e., recognizing and classifying objects in 3D scenes without being limited to a predefined set of classes, is a foundational task for robotics and extended reality applications.
Current leading methods often rely on 2D foundation models to extract semantics, then projected in 3D.
This paper investigates the viability of a purely 3D-native pipeline, thereby eliminating dependencies on 2D models and reprojections.
We systematically explored various architectural combinations using established 3D components.
However, our extensive experiments on benchmark datasets reveal significant performance limitations with this direct 3D-native approach, with performance metrics falling short of expectations.
Rather than a simple failure, these outcomes provide critical insights into the current deficiencies of existing 3D models when cascaded for complex open-vocabulary tasks.
We highlight the lessons learned, identify the pipeline's limitations (e.g., segmenter-encoder domain gap, robustness to imperfect segmentations), and posit future research directions.
We argue that a fundamental rethinking of model design and interplay is necessary to realize the potential of truly 3D-native open-vocabulary understanding.

Files

2025_CBMI___3D_Only_OVSU.pdf

Files (1.1 MB)

Name	Size	Download all
2025_CBMI___3D_Only_OVSU.pdf md5:32047d466f8f23810dd95ecfa570b706	1.1 MB	Preview Download

Additional details

European Commission
SUN - Social and hUman ceNtered XR 101092612

	All versions	This version
Views	12	12
Downloads	14	14
Data volume	27.0 MB	27.0 MB

Breaking the 2D Dependency: What Limits 3D-Only Open-Vocabulary Scene Understanding

Creators

Description

Files

2025_CBMI___3D_Only_OVSU.pdf

Files (1.1 MB)

Additional details

Funding