Multimodality in Media Retrieval

The quest for retrieving relevant media for a given query is well-studied and has various applications. Modern publicly available media collections provide diverse modalities of the same objects, which can enhance search. Our research delves into enhancing media retrieval by effectively representing and querying multimodal data. In the retrieval methods' ranking procedure, we examine efficiency through techniques like approximate nearest neighbor (ANN) indexing and high-performance computing (HPC). Our method, MuseHash, is proposed for single media object retrieval and is applied to images and 3D objects, outperforming existing methods on diverse datasets. Moreover, it significantly reduces execution times with ANN and HPC. Future plans include considering multimodality in the video retrieval domain.


INTRODUCTION
In today's digital era, the Internet grants us easy access to a plethora of media, ranging from simple images and text to more intricate structures like videos and 3D graphics.For instance, when viewing a scene from our favorite series, we perceive it as a dynamic structure comprising a sequence of frames, each accompanied by captions, timestamps, locations, and audio elements.Moreover, within the video, characters exist within a 3D space, adding layers of complexity.Despite its complexity, each video component corresponds to a specific moment.
These diverse forms of media find applications across various domains, including urban development [8], gaming, healthcare [37], historical analysis [25], archaeology [6] and computer-aided design (CAD) [12].However, the challenge lies in efficiently representing and amalgamating this heterogeneous data.Therefore, our primary aim is to devise methods for effectively representing and integrating [1,2,10,13,40] the information within these media collections.
To achieve this aim, we categorise different information of media into distinct abstract views, known as modalities.Our research focuses on four primary modalities: visual, which encompasses high-resolution images (VHR), low resolution images from drones [18,34] and individual frames extracted from videos; text, including keyframes of images and complex captions generated by models like CLIP [32]; temporal, referring to timestamps or timeframes within videos, capturing the temporal aspect of the data; and spatial, involving geographical coordinates and intricate structures such as point clouds representing objects in 3D space [26].While other modalities exist, such as in medical applications, our current research primarily emphasizes these four.This classification enables us to develop a structured approach to represent and analyse diverse media collections effectively.
By comprehending and effectively incorporating these modalities, we aim to enhance the management and analysis of diverse media collections.As we navigate these challenges, our research direction encompasses the following objectives: Objective 1 Developing multimodal retrieval approaches for static moments, spanning from simple image collections to complex 3D datasets (Section 3.1).Objective 2 Exploring multimodal methods in large-scale realistic scenarios via indexing and query processing optimization (Section 3.2).Objective 3 Exploring multimodal methods in video retrieval.
This will be outlined as future steps in Section 4. To address the first objective, we created MuseHash [31], a supervised Bayesian framework for unimodal image retrieval.We then advanced MuseHash [29] to facilitate multimodal retrieval, a development lauded for its adaptability across datasets and recognised excellence in 3D object retrieval [28].
For the second objective, we have connected ANN methods, MuseHash, and High-Performance Computing (HPC) infrastructures [30].Our observations indicate that certain ANN methods outperform brute-force ranking approaches.GPUs show potential for longer hashes due to their parallel processing capabilities.Our research underscores the superiority of query parallelism over data parallelism in retrieval strategies.
The remainder of this paper is organized as follows: Section 2 delves into previous works in the field.Section 3 provides an overview of our methodologies and outcomes on multimodality in image retrieval (Section 3.1) as well as our query processing evaluation techniques and its outcomes (Section 3.2).Our research plans for the next year are outlined in Section 4. Finally, Section 5 concludes the paper.

BACKGROUND
In this section, we overview current state-of-the-art methods in our research field, including multimodal retrieval techniques (Section 2.1) and query processing evaluation methods (Section 2.2).

Cutting-edge Multimodal Retrieval Techniques
In our investigation of multimodality in image retrieval, we categorise it into unimodal and multimodal scenarios based on the number of modalities involved.We prioritise supervised techniques, especially supervised hashing methods, known for their superior retrieval accuracy compared to unsupervised methods, along with their memory efficiency and speed in the retrieval process.
In unimodal image retrieval, a single modality, typically one specific type of data, is utilized.Our study explores various modalities such as text, image, datetime, location, mesh, and point-cloud, investigating diverse retrieval scenarios and applications.Notably, modalities like datetime and location have not received as much research attention as image and text modalities.
Supervised methods in this context often leverage deep learning networks like Convolutional Neural Networks (CNNs) for feature learning and hash function development.Deep Cauchy Hashing (DCH) [7] optimizes hash codes using Cauchy cross-entropy loss and quantization loss within a deep learning framework.Semantic Preserving Hashing (SePH) [21] minimizes Kullback-Leibler divergence to approximate semantic affinities while unifying hash codes for various views.
We surveyed various supervised hashing methods for multimodal image retrieval, which encompass strategies such as similaritybased, adversarial-based, deep neural networks, and discrete-based methods.FCMH [36] optimizes binary codes and DOCH [41] generates high-quality hash codes.LAH [39] focuses on image representations and label co-occurrence embeddings with Cauchy distribution-based hash functions.SSAH [19] incorporates a selfsupervised semantic network and adversarial learning.GSPH [24] learns hash codes and functions for two modalities.MTHF [22] transfers knowledge from single-modal to cross-modal domains, while KDLFH [20] directly learns binary hash codes.
While some methods are tailored for cross-modal scenarios, exceptions like LAH [39] support multimodal queries.In the 3D retrieval domain, CMCL [15] integrates multiple 3D modalities but can be computationally intensive and dataset-sensitive.

Cutting-edge Optimization Methods
We selected several cutting-edge ANN methods based on their unique and complementary features, following the approach outlined by Aumüller et al. [3].These include tree-based structures, graph-based structures, pruning techniques, brute-force approaches, and baseline methods.

CURRENT RESEARCH RESULTS
Figure 1 visually summarizes our research outcomes, each linked to its respective section for more details.To simplify, the light blue boxes represent the unimodal MuseHash method [31], which is the foundation of our work.The light purple blocks signify the extended MuseHash method [29], to handle multimodal data.The light orange and green blocks represent the investigations and evaluations in query processing, introducing two new components to our research.

Multimodality in Multimodal Retrieval
3.1.1Methodology.In our research, we created a novel supervised hashing method called MuseHash [31].MuseHash leverages Bayesian principles for hash function learning and adapts to the data's statistical properties, enhancing overall hashing and retrieval system performance.The method comprises three main phases: training, offline, and querying, each illustrated with light blue boxes in Figure 1.
During training, hash functions are generated from the training collection via Bayesian ridge regression, mapping feature vectors from the visual modality to the Hamming space.Affinity matrices are created using both ground truth labels and cosine similarity, from which semantic probabilities are derived through normalization.In the offline phase, features are extracted from the retrieval set for the visual modality.Using the learned hash functions, hash codes are computed and stored in a database, ensuring efficient storage and retrieval of multimedia data.Finally, during the querying phase, the learned hash functions are applied to a given query, and the database is queried using Hamming distances to retrieve the top-k relevant results.
To address flexible multimodal approaches, we extended Muse-Hash [31] to support any number of modalities [29].This enhancement enables efficient fusion of different modalities for the same object, computing hash codes and retrieving relevant items.

Evaluation.
The researchers in the retrieval domain are more familiar with very high resolution data, rather than data from underwater and aerial footage.It is a challenge to handle all this different collections in such a way that the retrieval is efficient.
Hence, MuseHash [29] is specifically is designed specifically for multimodal queries across diverse collections.In our study, we compare MuseHash with LAH across five datasets (Table 2), varying hash code lengths and utilizing all available modalities.In specific table, both the query and each element within the collection leverage all available modalities, integrating them into a unified hash code.While LAH can only fuse two modalities, MuseHash accommodates more than two modalities.AU-AIR [5] and MarDCT [4] are UAV datasets that contains image with temporal, and image with geotemporal information, respectively.SeaDronesSee [35] is a underwater dataset, which include images with geotemporal information.MIRFlickr25K [14] and NUS-WIDE [9] are benchmark high resolution datasets commonly utilized in the literature.
MuseHash consistently outperforms LAH in all cases with statistical significance (the symbol "*" denotes statistical significance after t-test).For enhanced evaluation robustness, we utilized a 5-fold cross-validation methodology across all experiments.Particularly, MuseHash performs exceptionally well when using all modalities in the MarDCT, MIRFlickr25K, and NUS-WIDE datasets, benefiting from the high-quality information present in these collections.
In summary, MuseHash emerges as a superior performer, consistently outpacing seven state-of-the-art methods across diverse image collections in both multimodal and unimodal scenarios.It achieves this by leveraging a combination of various visual descriptors such as VGG16 and ResNet50, and textual descriptors like Bag-of-Words (BoW) or BERT.Moreover, MuseHash offers flexibility with hash code lengths ranging from 16-bit to 128-bit, catering to different requirements and scenarios.This comprehensive approach enables MuseHash to exhibit greater robustness compared to existing methods, making it a compelling choice for multimodal query tasks across diverse datasets.
Expanding into 3D collections, we have extended the application of MuseHash from image retrieval to 3D object retrieval [28].Specifically, we have adapted the multimodal MuseHash technique for volumetric data queries.In this context, the multimodal approach integrates various types of data representations associated with 3D objects, such as meshes, point clouds.By doing so, MuseHash extends its utility beyond traditional 2D image datasets, allowing for more comprehensive and effective retrieval of 3D objects based on multimodal queries.
During evaluation, MuseHash consistently outperformed other methods in both unimodal and multimodal scenarios across different hash lengths and epochs for the ModelNet40 dataset (Table 1).ModelNet40 [38] and BuildingNet_v0 [33] are two publicly available benchmark datasets, containing image, mesh, and point cloud representations of 3D objects.The former is a dataset of 3D CAD models dedicated to object categories (e.g., car, airplane), while the latter includes different building types (e.g., church, palace) associated with their textures (e.g., color, material).
While the CMCL approach excelled in accuracy with more epochs, its mAP performance fell short (similarly, the symbol "*" denotes statistical significance after t-test).Likewise, we conducted 5-foldcross validation in the evaluation.Overall, MuseHash demonstrated competitive performance.The multimodal variant showed significant performance improvements with longer code lengths (16 to 32), especially for larger lengths (64 and 128).However, extending the code length beyond this range did not yield substantial gains.
In conclusion, our study applied advanced image retrieval methods to 3D object retrieval, adapting MuseHash for volumetric data queries.MuseHash's exploitation of inter-modality relationships consistently outperformed three state-of-the-art methods across two benchmark image collections.

Evaluation in Query Processing
3.2.1 Methodology.This section discusses the relationship between MuseHash, Approximate Nearest Neighbor (ANN) methods, and High-Performance Computing (HPC) Infrastructure [30]. Figure 1 illustrates new components integrated into the MuseHash architecture, depicted in light green and light orange.The light green block represents the integration of ANN methods into the MuseHash ranking process, while the light orange block signifies optimization techniques applied to feature and ranking processes.
We prioritise hardware resource optimization through parallel computing using multithreading and GPUs, particularly leveraging NVIDIA CUDA for efficient feature extraction in both offline and querying phases.Additionally, we explore multi-GPU setups to significantly enhance performance.To evaluate our methods' performance, we utilize two approaches: data parallelism and query parallelism.
Data parallelism Splitting the data into segments enables distribution among multiple processes for searching, allowing us to assess system scalability when numerous processes collaborate to process the data.Query parallelism Maintaining a pool of processes ready to handle incoming queries allows for efficient allocation of queries, aiding in the assessment of how effectively the system manages concurrent queries.

Evaluation.
When handling multimodal media, the speed and efficiency are crucial.Our objective [30] is to identify effective techniques for data acquisition and analysis across both CPU and GPU platforms.The CPU experiments encompass datasets such as AU-AIR and LSC'23 [11] datasets, while the GPU experiments focusing solely on AU-AIR.
The LSC'23 dataset was generated by an active lifelogger over the course of 18 months and captured by a wearable camera.Every image within this dataset is associated with pertinent captions, temporal data, spatial information, or a combination of these elements.Below, we summarize our main findings.
Superior Throughput and Quality The graph-based Hnswlib method excelled in both throughput and result quality Multi-Core Synergy Combining specific data organization methods with utilizing multiple cores of a computer concurrently results in accelerated processing.Comparison of GPU and CPU CPUs beat GPUs in some tasks, needing smart strategies for specialised chip potential.Complex data slows processing, crucial to tackle for efficiency.

FUTURE WORK
Our plans for the remainder of the thesis involves developing an efficient video data querying framework.Initially, we aim to devise a method tailored to benchmark datasets and benchmark it against state-of-the-art (SOTA) methods.Furthermore, we plan to evaluate video retrieval methods using more realistic workloads and datasets like the Known-Item Search (KIS) tasks from Video Browser Showdown (VBS) [23] and Lifelog Search Challenge (LSC) [11] competitions.Additionally, inspired by recent research, we seek to explore various modalities for the VBS task, investigating which combinations-such as action-based, visual-based, or CLIP-basedyield the most benefit for our system [16,17].If initial approaches do not yield expected results, we consider examining SOTA reinforcement learning methods for the KIS task as an alternative direction.Subsequently, we intend to integrate our proposed techniques into our search engine system, VERGE [27], which participates in the VBS competition.This will allow us to observe first-hand how our framework performs in practical search scenarios.

Table 1 :
MAP and accuracy results for ModelNet40 and BuildingNet_v0 with different code lengths or number of epochs and query modalities.

Table 2 :
Multimodal query results for different datasets and different code lengths and all modalities.