Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders

doi:10.1145/3451390

Published November 1, 2021 | Version v1

Journal article Open

Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders

Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences (i.e., image regions and words, respectively) to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task.

Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way toward the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the Recall@1 metric. The code used for the experiments is publicly available on GitHub at https://github.com/mesnico/TERAN.

Files

2021_457546.postprint.pdf

Files (3.0 MB)

Name	Size	Download all
2021_457546.postprint.pdf md5:af2a63dbd7e32d7e5049fb75697d07a4	3.0 MB	Preview Download

Additional details

AI4Media – A European Excellence Centre for Media, Society and Democracy 951911: European Commission
AI4EU – A European AI On Demand Platform and Ecosystem 825619: European Commission

	All versions	This version
Views	60	60
Downloads	260	253
Data volume	799.8 MB	779.0 MB

Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders

Creators

Description

Files

2021_457546.postprint.pdf

Files (3.0 MB)

Additional details

Funding