Vision-Language Pretraining for Variable-shot Image Classification
Creators
Description
Contrastively pretrained vision-language models (VLMs) such as CLIP have shown impressive zero-shot classification performance without any classification-specific training. They create a common embedding space by contrastively pretraining an image and a text encoder to align positive image-text pairs and repel negative pairs. Then zero-shot classification of an image can be performed by measuring the cosine similarities between the image embedding and embeddings of texts that describe the classes. However, relevant works do not address the scenario in which few image examples for some (not all) classes are available. In this novel task which we term variable-shot (v-shot) classification, these models fail due to the embedding space modality gap, i.e. the fact that image-to-image similarities are higher than image-to-text ones. To this end, we propose to enable v-shot capabilities in pre-trained VLMs with minimal training complexity by re-projecting embeddings of frozen pre-trained image encoders using a shallow network, RectNet, which we train both with the standard CLIP contrastive loss function, as well as a novel modality alignment loss function specifically constructed to bridge the modality gap. Finally, we introduce three v-shot classification benchmarks, on which the proposed architecture achieves 32.22%, 29.58% and 45.15% increases in top-1 classification accuracy respectively.
Files
Vision_Language_Pretraining_for_Variable_shot_Image_Classification__ZENODO_.pdf
Files
(734.0 kB)
Name | Size | Download all |
---|---|---|
md5:eca6c58bdd5b7aa22564dcdcd8b0ba3b
|
734.0 kB | Preview Download |