Cross-Modal Learning for Free-Text Video Search

Galanopoulos, Damianos; Mezaris, Vasileios

doi:10.5281/zenodo.13710451

Published September 6, 2024 | Version v1

Book chapter Open

Cross-Modal Learning for Free-Text Video Search

1. Centre for Research and Technology Hellas

This chapter focuses on cross-modal video retrieval, a technology with wide-ranging applications across media networks, security organizations, and even individuals managing large personal video collections. We discuss the concept of cross-modal video learning and offer an overview of deep neural network architectures in the literature, focusing on methods combining visual and textual representations for cross-modal video retrieval. We also examine the impact of vision transformers, a learning paradigm significantly improving cross-modal learning performance. Also, we present a novel cross-modal network architecture for free-text video retrieval called T×V+𝑂𝑏𝑗𝑒𝑐𝑡𝑠. This method extends an existing state-of-the-art network by incorporating object-based video encoding using transformers. It leverages multiple latent spaces and combines detected objects with textual features, creating a joint embedding space for improved text-video similarity.

Files

Cross-modal learning for free-text video search_revised_clean.pdf

Files (1.0 MB)

Name	Size	Download all
Cross-modal learning for free-text video search_revised_clean.pdf md5:9767b03b93bc4b3c0c8bcdae747dc33b	1.0 MB	Preview Download

Additional details

European Commission
CRiTERIA - Comprehensive data-driven Risk and Threat Assessment Methods for the Early and Reliable Identification, Validation and Analysis of migration-related risks 101021866

	All versions	This version
Views	79	79
Downloads	50	50
Data volume	59.1 MB	59.1 MB

Cross-Modal Learning for Free-Text Video Search

Authors/Creators

Description

Files

Cross-modal learning for free-text video search_revised_clean.pdf

Files (1.0 MB)

Additional details

Funding