Efficient Training of Visual Transformers with Small Datasets
Authors/Creators
- 1. University of Trento, Italy
- 2. Tencent AI Lab
- 3. FBK
Description
Visual Transformers (VTs) are emerging as an architectural paradigm alternative to
Convolutional networks (CNNs). Differently from CNNs, VTs can capture global
relations between image elements and they potentially have a larger representation
capacity. However, the lack of the typical convolutional inductive bias makes these
models more data hungry than common CNNs. In fact, some local properties of the
visual domain which are embedded in the CNN architectural design, in VTs should
be learned from samples. In this paper, we empirically analyse different VTs,
comparing their robustness in a small training set regime, and we show that, despite
having a comparable accuracy when trained on ImageNet, their performance on
smaller datasets can be largely different. Moreover, we propose an auxiliary selfsupervised
task which can extract additional information from images with only a
negligible computational overhead. This task encourages the VTs to learn spatial
relations within an image and makes the VT training much more robust when
training data is scarce. Our task is used jointly with the standard (supervised)
training and it does not depend on specific architectural choices, thus it can be
easily plugged in the existing VTs. Using an extensive evaluation with different
VTs and datasets, we show that our method can improve (sometimes dramatically)
the final accuracy of the VTs. Our code is available at: https://github.com/
yhlleo/VTs-Drloc.
Files
NeurIPS-2021-efficient-training-of-visual-transformers-with-small-datasets-Paper (3).pdf
Files
(4.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:411e5a67490ed2aa41b4a00226d6c641
|
4.9 MB | Preview Download |