Attention Mechanisms, Signal Encodings and Fusion Strategies for Improved Ad-hoc Video Search with Dual Encoding Networks

In this paper, the problem of unlabeled video retrieval using textual queries is addressed. We present an extended dual encoding network which makes use of more than one encodings of the visual and textual content, as well as two different attention mechanisms. The latter serve the purpose of highlighting temporal locations in every modality that can contribute more to effective retrieval. The different encodings of the visual and textual inputs, along with early/late fusion strategies, are examined for further improving performance. Experimental evaluations and comparisons with state-of-the-art methods document the merit of the proposed network.


INTRODUCTION
In the last years, the explosion of social media use has lead to a rapid increase in the multimedia content that is available on the Internet. This content originates from a variety of sources, and its nature is extremely heterogeneous, i.e. it includes video, images, audio, text etc., and combinations of them. Despite this multimodality, textbased queries remain the most natural way for people to search for content -be it video, images etc. The research field of text-based video retrieval, or more general cross-modal retrieval, addresses Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. ICMR '20, June 8-11, 2020, Dublin, Ireland © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-7087-5/20/06. . . $15.00 https://doi.org/10.1145/3372278.3390737 the problem of retrieving items of one modality (in our case, video) when the given query is of another modality (text).
A typical application scenario of text-based video search is Adhoc Video Search (AVS), originally introduced as a TRECVID benchmark task [23] [1]. Given a set of unlabeled video shots and an unseen textual query, the goal of an AVS method is to retrieve the most related video shots, ranked from the most relevant to the least relevant shot for the query. The main challenge of AVS and its key difference from other video retrieval problems (e.g. concept-based retrieval [19] [11]) is the lack of video examples for the queries. Moreover, these queries, which are given in natural language form, contain complex subject relations, e.g., Find shots of exactly two men at a conference or meeting table talking in a room.
Many methods have been proposed for the AVS problem in recent years, e.g. [19][15] [25]. Their majority relies on examining the correlation of visual concepts with the textual queries, i.e. they use a variety of pre-trained visual concept detectors. The detectors' number and diversity are crucial for the retrieval performance.
During the last few years, several deep learning methods have been proposed for visual or text analysis and classification. Progress in the natural language processing field led to compelling text embedding methods [20] [5] and gave the necessary boost to a variety of cross-model problems such as text-based video retrieval and image/video captioning. For this reason, recent AVS methods use deep learning for embedding the representation of different modalities (textual queries, videos) into a common subspace in a way that the new representations can be compared directly.
In this paper, we focus on the AVS problem. We use as a starting point a deep network architecture introduced in [8], in which two similar networks are jointly trained using several state-of-the-art (SoA) methods such as recurrent neural networks (GRUs) [4], text embeddings [20], and deep image classification networks [13] [26]. We extend this by introducing two attention mechanisms. We also introduce and examine the impact of using more than one encodings of the visual content as well as of the textual query. As is typically the case in the relevant literature, pairs of video shots and captions are used for training the network. The main contributions of our work are summarized as follows: • We integrate and evaluate two attention mechanisms into the dual encoding network. These lead to better textual and visual representation into the common subspace.

RELATED WORK
Early solutions to the AVS problem were based on large pools of visual concept detectors, and NLP techniques for query decomposition in order to identify concepts in the textual queries. In [19], a set of NLP rules and a variety of pre-trained deep neural networks for video annotation were used in order to associate visual concepts with the provided textual queries. In [14], a large amount of concept, scene and object detectors were used along with an inverted index structure for query-video association. Recent SoA approaches rely on deep neural networks for directly comparing textual queries and the visual content in a common space [12]. Also, inspired by problems similar to AVS, e.g. cross-modal retrieval or visual question-answering, solutions that have been proposed for these problems were modified and adapted to AVS. In [9], an improved multi-modal embeddings system was proposed, together with a loss function that utilizes the hard negative samples of the dataset; this approach was adapted to the AVS problem in [3]. In [15], an improved version of the image-to-text matching method of [6] was proposed for the AVS task. More specifically, [15] used the method of [6] together with the triplet loss function of [9] and an improved sentence encoding strategy. In [22], a weakly-supervised method was proposed to learn a joint visual-text embedding space using an attention mechanism to highlight temporal locations in a video that are relevant to a textual description. This mechanism was also used for extracting text-depended visual features. Recently, the dual encoding network proposed in [8] encodes videos and queries into a dense representation using multi-level encodings for both text and videos and the improved loss function of [9]. In [10], the problem of video retrieval was addressed by training three different networks using different training datasets, and combining them by using an additional neural network.

PROPOSED METHOD
In this work, we propose an improved dual encoding method designed for Ad-hoc Video Search. Inspired by the dual encoding network presented in [8] (Section 3.1), we create a network that encodes video-caption pairs into a common feature subspace. In contrast to [8], our network utilizes attention mechanisms for more efficient textual and visual representation, and exploits the benefits of richer textual and visual embeddings.
Let V be a media item (e.g., an entire video or a video shot) and S the corresponding caption of V. Our network translates both V and S into a new common feature space Φ(·), resulting in two new representations Φ(V) and Φ(S) that are directly comparable. For this, two similar modules, consisting of multiple levels of encoding, are utilized, for the visual and textual content respectively. Moreover, two new attention components are integrated into the baseline network. The overall network architecture is illustrated in Fig. 1.

Dual encoding network
For every video three different encodings are created, ( ) 1 , ( ) 2 , ( ) 3 . We consider a video or a video shot as a sequence of keyframes V = { 1 , 2 , . . . , }, where each keyframe vector is the output of a pre-selected hidden layer of a pretrained deep network, e.g. the pool5 layer of Resnet [13] or Resnext [26]. The first encoding is the global representation of every video and is obtained by mean pooling the individual keyframe representations, as follows: ( ) 1 = 1 =1 . Next, the keyframe representation vectors { 1 , 2 , . . . , } are fed in a sequence of bi-directional Gated Recurrent Units [4] (bi-GRUs).
The hidden state in time of a forward − −− → is defined as , and in the backward To obtain the second level of encoding, mean pooling of the ℎ values is performed as follows: ( ) 2 = 1 =1 ℎ . Subsequently, a 1-d CNN is built and fed with the feature matrix H v . A convolutional layer 1 , is used, with filters of size . After applying ReLU activation and max pooling to the layer's output, the = ( ( 1 , (H v ))) vector is produced. Multiple representations of the video are created, using different = 2, 3, 4, 5 values. The third-level video representation is the concatenation of the produced vectors: Finally, the concatenation of the previously generated features is used as the global and multi-level feature representation of a video: where and are trainable parameters and a batch normalization layer.
Similar to the visual content encoding network, a multilevel encoding ( ) 1 , ( ) 2 , ( ) 3 is generated for the textual content. Given a sentence containing words, the ( ) 1 representation is created by averaging individual one-hot-vectors { 1 , 2 , . . . , }. Next, as the second level of textual encoding, a deep network-based representation for every word is used as input for the bi-directional GRU module, and similarly to ( ) 2 , ( ) 2 = 1 =1 ℎ . Next, the feature matrix H s of the textual bi-GRUs is forwarded into a 1-d convolutional layer with filter sizes = 2, 3, 4 and ( ) 3 is calculated similarly to ( ) 3 above. The final textual representation is: Following [9], [21] and [8], the improved marginal ranking loss is used to train the entire network.

Introducing Self-Attention Mechanisms
The 1-d CNN layer that is fed with H s or H v in the original network of [8] treats each item of the words or frames sequence equally. Our target is to exploit the most meaningful information from the textual and visual sequences, particularly the words with the highest semantic importance and the keyframes which are more representative for a video shot. For this, we introduce a self-attention mechanism [2] [18] in each modality, in order to find the relevant importance of each word in the input sentence, and to find important temporal locations in a video-shot. An overview of this self-attention mechanism is illustrated in Fig. 2.
In the textual encoding part of the network, given the output of the bi-GRU H s , the attention model outputs a vector : where 1 is a trainable weight matrix of size × 2 , where is a hyper-parameter, 2 is the size of a single bi-GRU unit and 2 a parameter vector of size . The 2 vector is extended in a Session: Poster (Short) ICMR '20, October 26-29, 2020, Dublin, Ireland Proceedings published June 8, 2020  × matrix 2 for multi-head attention, by modeling semantic aspects of the text, as in [18], resulting in a weight matrix A s : The () is used for weight normalization, so that all the weights sum up to 1. Then, the attention matrix A s is multiplied with the initial H s , resulting in matrix: H s = A s H s H s is forwarded into the 1-d convolutional layer instead of the feature matrix H s , as described in Sec. 3.1. This text-based selfattention mechanism is denoted as Att in the sequel.
A similar self-attention mechanism, denoted as Atv, is integrated in the visual encoding module. In this case H v is used for calculating the attention weighted matrix A v , resulting in

Examining multiple encodings and fusion strategies
Dealing with such a demanding task, where typically the SoA methods achieve accuracy of about 10 − 22% on different evaluation datasets, it is vital to exploit the advantages of different signal encodings. Regarding the video module, two SoA deep neural network architectures are used for frame feature extraction: the ResNext-101 [26] and ResNet-152 [13] models. Concerning the text module, the performance of the Word2Vec [20] model, as well as the bidirectional transformer-based language model BERT [5], are examined. We also examine early fusion (as shown in Fig. 1) i.e. concatenation of encoding vectors, versus late fusion (i.e. merging of ranked lists, each obtained using a different text-visual encoding pair), for jointly exploiting the multiple encodings.

EXPERIMENTS 4.1 Experimental setup
We train our network 1 using the combination of two large-scale video datasets: MSR-VTT [27] and TGIF [17]. We evaluate its performance on the official evaluation dataset of the TRECVID AVS task for the years 2016, 2017, and 2018, i.e. the IACC.3 test collection consisting of 4,593 videos and altogether 335,944 shots. As evaluation measure we use mean extended inferred average precision (MX-infAP), which is an approximation of the mean average precision suitable for the partial ground-truth that accompanies the TRECVID dataset. As initial frame representations, generated by a ResNext-101 (trained on the ImageNet-13k dataset) and a ResNet-152 (trained on the ImageNet-11k dataset), we use the publicly-available features released by [15]. Also, two different word embeddings are utilized: i) the Word2Vec model [20] trained on the English tags of 30K Flickr images, provided by [7]; and, ii) the pre-trained language representation BERT [5], trained on Wikipedia content.

Results and discussion
For comparison reasons, we used the publicly available code of [8] to re-train the network with the same configuration and features we use in our methods. This method is indicated as W2V+ResNext-101 in Table 1 and is used as a baseline for our experiments. Overall, three general network architectures are trained, i) the baseline network, ii) the network with the text-based attention mechanism, and iii) the network with the visual-based attention. Each network is trained using one or both available word embeddings (i.e., Word2Vec [a.k.a. 1 Software available at: github.com/bmezaris/AVS_dual_encoding_attention_network Session: Poster (Short) ICMR '20, October 26-29, 2020, Dublin, Ireland Proceedings published June 8, 2020 Table 1: Results (MXinfAP) of the proposed networks and their combinations, compared with the baseline [8]. The best results for each dataset are indicated with bold, while those that are worse than the baseline are given in parenthesis. All reported training/inference times are in hours, for a single setup (should be multiplied by 6 for the Combination of 6 setups) and for processing the whole training/test dataset. These numbers are not to be confused with query execution time; this is approximately 30 sec. for all but the late fusion methods, and 4 times higher for the latter.  [24], while the Best of 6 setups column presents the results of the best-performing among these setups. The results reported in Table 1(a)-(f) show that both attention mechanisms improve the performance of the baseline method. Furthermore, using better word embeddings (BERT) consistently improves the performance in comparison to W2V.
In the (g), (h) and (i) configurations of Table 1, the results of the early fusion of the text and visual embeddings are presented. The results indicate that the combination of different visual (ResNext-101, ResNet-152) and textual (W2V, BERT) features leads to improved performance. Moreover, the integration of the aforementioned attention mechanisms further improves performance.
Subsequently, in configurations (j), (k) and (l) of Table 1 the performance of late-fusion combinations of the previously-examined networks is presented. In configuration (j), the late fusion of the baseline network trained with different textual and visual features is presented. When considering the combination of 6 setups, this approach usually outperforms by a small margin the corresponding early fusion model (g). The late fusion of models with text-or visual-based attention performs similarly to, or a bit better when combining the best of 6 setups (columns II) compared to the corresponding early fusion approaches, however at the expense of considerably higher training and inference times.
In Table 2 the recommended single-setup early fusion configurations of the proposed method (shaded rows (h) and (i) of Table  1) are compared with the literature SoA works (included the topperformer of the TRECVID 2018 competition [15]), based on the

CONCLUSIONS
This paper examined the problem of video retrieval using textual queries. We focused on a network that encodes the visual and text modalities into a common space. We extended this network by integrating a self-attention mechanism in each modality. The experimental results confirm the contribution of this extension to the performance of the network. Moreover, the effectiveness of using multiple textual and visual representations was experimentally evaluated, and the early fusion of the different text and visual encodings, together with an attention mechanism, was shown to achieve state of the art results without considerable impact on the time-efficiency of the network's training and inference.

ACKNOWLEDGMENTS
This work was supported by the EU Horizon 2020 research and innovation programme under grant agreement H2020-780656 ReTV.