On Segment-Aware Monocular Depth Estimation Using Vision Transformers

Arampatzakis, Vasileios; Pavlidis, George; Mitianoudis, Nikolaos; Papamarkos, Nikos

doi:10.3390/info17020145

Published February 2, 2026 | Version v1

Journal Open

On Segment-Aware Monocular Depth Estimation Using Vision Transformers

1. Athena Research and Innovation Center In Information Communication & Knowledge Technologies
2. Democritus University of Thrace

Monocular Depth Estimation (MDE) infers per-pixel scene geometry from a single RGB image. Despite recent progress, global MDE models often blur depth discontinuities at object boundaries and fail to capture object-level structure. Segment-aware depth estimation addresses this limitation by exploiting semantic segmentation to decompose depth prediction into simpler, class-specific subproblems. In this work, we study semantic-aware MDE in a multi-branch design where each semantic class is handled by a lightweight Vision Transformer (ViT) branch that predicts dense depth for its class while suppressing interference from other regions. We further examine fusion strategies that merge the branch outputs into a single prediction: (i) a learnable cross-attention fusion module that predicts depth from the stack of per-class proposals and masks, and (ii) a parameter-free stitched summation that sums mask-gated outputs. The proposed architecture is simple, scalable, end-to-end trainable, and compatible with arbitrary transformer backbones. Experiments on Virtual KITTI 2, where ground-truth depth and semantic labels are available, show that segment-aware modeling produces sharper depth boundaries and improves standard error metrics compared to a single-branch baseline (AbsRel 0.243→0.152; RMSE 11.952→9.101). Finally, we find that the parameter-free summation matches, and in most cases improves upon, the accuracy of learned fusion while adding no computational overhead.

Files

On Segment-Aware Monocular Depth Estimation Using Vision Transformers.pdf

Files (4.9 MB)

Name	Size	Download all
On Segment-Aware Monocular Depth Estimation Using Vision Transformers.pdf md5:3e32e60de85d349afdd191530f899ec0	4.9 MB	Preview Download

Additional details

European Commission
ARGUS - Non-destructive, scalable, smart monitoring of remote cultural treasures 101132308

	All versions	This version
Views	25	25
Downloads	19	19
Data volume	117.2 MB	117.2 MB

On Segment-Aware Monocular Depth Estimation Using Vision Transformers

Authors/Creators

Description

Files

On Segment-Aware Monocular Depth Estimation Using Vision Transformers.pdf

Files (4.9 MB)

Additional details

Funding