Vision-Language Pretraining for Variable-shot Image Classification

Papadopoulos, Sotirios; Ioannidis, Konstantinos; Vrochidis, Stefanos; Kompatsiaris, Ioannis (Yiannis); Patras, Ioannis

doi:10.5281/zenodo.14024546

Published January 2025 | Version v1

Conference paper Open

Vision-Language Pretraining for Variable-shot Image Classification

1. Centre for Research and Technology Hellas
2. Queen Mary University of London
3. Centre for Research and Technology-Hellas

Contrastively pretrained vision-language models (VLMs) such as CLIP have shown impressive zero-shot classification performance without any classification-specific training. They create a common embedding space by contrastively pretraining an image and a text encoder to align positive image-text pairs and repel negative pairs. Then zero-shot classification of an image can be performed by measuring the cosine similarities between the image embedding and embeddings of texts that describe the classes. However, relevant works do not address the scenario in which few image examples for some (not all) classes are available. In this novel task which we term variable-shot (v-shot) classification, these models fail due to the embedding space modality gap, i.e. the fact that image-to-image similarities are higher than image-to-text ones. To this end, we propose to enable v-shot capabilities in pre-trained VLMs with minimal training complexity by re-projecting embeddings of frozen pre-trained image encoders using a shallow network, RectNet, which we train both with the standard CLIP contrastive loss function, as well as a novel modality alignment loss function specifically constructed to bridge the modality gap. Finally, we introduce three v-shot classification benchmarks, on which the proposed architecture achieves 32.22%, 29.58% and 45.15% increases in top-1 classification accuracy respectively.

Files

Vision_Language_Pretraining_for_Variable_shot_Image_Classification__ZENODO_.pdf

Files (734.0 kB)

Name	Size	Download all
Vision_Language_Pretraining_for_Variable_shot_Image_Classification__ZENODO_.pdf md5:eca6c58bdd5b7aa22564dcdcd8b0ba3b	734.0 kB	Preview Download

	All versions	This version
Views	206	206
Downloads	131	131
Data volume	111.6 MB	111.6 MB

Vision-Language Pretraining for Variable-shot Image Classification

Authors/Creators

Description

Files

Vision_Language_Pretraining_for_Variable_shot_Image_Classification__ZENODO_.pdf

Files (734.0 kB)