Published October 27, 2021 | Version v1
Conference paper Open

Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction

  • 1. Harbin Institute of Technology, China
  • 2. ETH Zurich, Switzerland
  • 3. 1Harbin Institute of Technology, China
  • 4. University of Trento, Italy

Description

While convolutional neural networks have shown a
tremendous impact on various computer vision tasks, they
generally demonstrate limitations in explicitly modeling
long-range dependencies due to the intrinsic locality of
the convolution operation. Initially designed for natural
language processing tasks, Transformers have emerged as
alternative architectures with innate global self-attention
mechanisms to capture long-range dependencies. In this
paper, we propose TransDepth, an architecture that benefits
from both convolutional neural networks and transformers.
To avoid the network losing its ability to capture locallevel
details due to the adoption of transformers, we propose
a novel decoder that employs attention mechanisms
based on gates. Notably, this is the first paper that applies
transformers to pixel-wise prediction problems involving
continuous labels (i.e., monocular depth prediction and
surface normal estimation). Extensive experiments demonstrate
that the proposed TransDepth achieves state-of-theart
performance on three challenging datasets. Our code is
available at: https://github.com/ygjwd12345/
TransDepth.

Files

Yang_Transformer-Based_Attention_Networks_for_Continuous_Pixel-Wise_Prediction_ICCV_2021_paper).pdf

Additional details

Funding

AI4Media – A European Excellence Centre for Media, Society and Democracy 951911
European Commission