File uploads: We have fixed an issue which caused file uploads to fail. We apologise for the inconvenience it may have caused.

Published November 21, 2022 | Version v1
Other Open

Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

  • 1. CERTH-ITI

Description

In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on text-to-video retrieval. We investigate how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, which encode text-video pairs into comparable representations. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure. Moreover, at the retrieval stage, we introduce additional softmax operations for revising the inferred query-video similarities. Extensive experiments in several setups based on three large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to best combine text-visual features and document the performance of the proposed network.

Notes

Accepted for publication; to be included in Proc. ECCV Workshops 2022. The version posted here is the "submitted manuscript" version

Files

2211.11351.pdf

Files (2.4 MB)

Name Size Download all
md5:c72c38462f0c4c3b64a261910b455f65
2.4 MB Preview Download

Additional details

Funding

CRiTERIA – Comprehensive data-driven Risk and Threat Assessment Methods for the Early and Reliable Identification, Validation and Analysis of migration-related risks 101021866
European Commission