On Segment-Aware Monocular Depth Estimation Using Vision Transformers
Authors/Creators
Description
Monocular Depth Estimation (MDE) infers per-pixel scene geometry from a single RGB image. Despite recent progress, global MDE models often blur depth discontinuities at object boundaries and fail to capture object-level structure. Segment-aware depth estimation addresses this limitation by exploiting semantic segmentation to decompose depth prediction into simpler, class-specific subproblems. In this work, we study semantic-aware MDE in a multi-branch design where each semantic class is handled by a lightweight Vision Transformer (ViT) branch that predicts dense depth for its class while suppressing interference from other regions. We further examine fusion strategies that merge the branch outputs into a single prediction: (i) a learnable cross-attention fusion module that predicts depth from the stack of per-class proposals and masks, and (ii) a parameter-free stitched summation that sums mask-gated outputs. The proposed architecture is simple, scalable, end-to-end trainable, and compatible with arbitrary transformer backbones. Experiments on Virtual KITTI 2, where ground-truth depth and semantic labels are available, show that segment-aware modeling produces sharper depth boundaries and improves standard error metrics compared to a single-branch baseline (AbsRel 0.243→0.152; RMSE 11.952→9.101). Finally, we find that the parameter-free summation matches, and in most cases improves upon, the accuracy of learned fusion while adding no computational overhead.
Files
On Segment-Aware Monocular Depth Estimation Using Vision Transformers.pdf
Files
(4.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:3e32e60de85d349afdd191530f899ec0
|
4.9 MB | Preview Download |