Conditioned Image Retrieval for Fashion using Contrastive Learning and CLIP-based Features

doi:10.1145/3469877.3493593-04-04

Published December 12, 2021 | Version v1

Conference paper Open

Conditioned Image Retrieval for Fashion using Contrastive Learning and CLIP-based Features

1. University of Florence, Italy

Building on the recent advances in multimodal zero-shot represen-
tation learning, in this paper we explore the use of features obtained
from the recent CLIP model to perform conditioned image retrieval.
Starting from a reference image and an additive textual description
of what the user wants with respect to the reference image, we
learn a Combiner network that is able to understand the image
content, integrate the textual description and provide combined
feature used to perform the conditioned image retrieval. Starting
from the bare CLIP features and a simple baseline, we show that
a carefully crafted Combiner network, based on such multimodal
features, is extremely effective and outperforms more complex state
of the art approaches on the popular FashionIQ dataset.

Files

3469877.3493593.pdf

Files (566.2 kB)

Name	Size	Download all
3469877.3493593.pdf md5:7f582bbbdedea53ae1b0570ee32e7819	566.2 kB	Preview Download

Additional details

AI4Media – A European Excellence Centre for Media, Society and Democracy 951911: European Commission

	All versions	This version
Views	164	158
Downloads	188	185
Data volume	111.5 MB	109.9 MB

Conditioned Image Retrieval for Fashion using Contrastive Learning and CLIP-based Features

Creators

Description

Files

3469877.3493593.pdf

Files (566.2 kB)

Additional details

Funding