VERGE IN VBS 2020

,


Introduction
VERGE is an interactive video retrieval system that provides users with efficient browsing and various search capabilities inside a set of video collections. For more than ten years, VERGE has been participating in numerous video retrieval related conferences and showcases, including TRECVID [1] and Video Browser Showdown (VBS) [2], thus the system is adapted to support the Known Item Search (KIS), Instance Search (INS) and Ad-Hoc Video Search (AVS) tasks. Experience from previous participation drove this year's selection of mature solutions (Section 2.1), the improvement of old modalities (Sections 2.2, 2.6), the integration of new (Sections 2.3, 2.4, 2.5) and also any advances regarding the user experience.

Video Retrieval System
VERGE serves as a video search engine with user-friendly browsing and a variety of modules to retrieve an image or a video from a collection. Furthermore, different search functionalities can be fused to create a combined query or they can be used consecutively to rerank the top results. A detailed description of the implemented indexing and retrieval modules follows in the next subsections, while the general architecture of VERGE can be seen in Figure 1. It should be noted here that all shot-based algorithms are based on the keyframes that derived from the provided V3C1 segmentation.

Visual Similarity Search
This module performs visual-based retrieval similarity of relevant content using convolutional neural networks (CNNs) upon a deep hashing architecture. A deep hashing approach will be followed in order to represent the visual information into a few bits (12,24,32,48) [3]. Then, the retrieval framework will retrieve the relevant visual content by comparing the hamming distance of the generated binary vectors between the gallery images and the query. The backbone convolutional network will be an architecture similar to AlexNet or VGG16. Eventually, an IVFADC index database vector will be created for fast binary indexing and K-Nearest Neighbors will be computed for the query image [4].

Concept-Based Retrieval
This module annotates each keyframe with a pool of concepts, which comprises 1000 ImageNet concepts, 345 concepts of the TRECVID SIN task [5], 500 eventrelated concepts, 80 action-related, 365 scene classification concepts, 580 object labels and 30 style-related concepts. For performing the annotation, each keyframe was split to 9 equally-sized regions by applying a 3 × 3 grid, and each region, as well as the whole image, were processed separately so as to incorporate coarse localization information to the annotations. To obtain the annotation scores for the 1000 ImageNet concepts, we used an ensemble method, averaging the concept scores from four pre-trained models that employ different DCNN architectures, namely the VGG16, InceptionV3, InceptionResNetV2, as well as a hybrid model that combines the ImageNet and Places365 concept pools [6]. To obtain scores for the 345 TRECVID SIN concepts, we used the deep learning framework of [7]. For the event-related concepts we used the pre-trained model of EventNet [8] while for the action-related concepts we used a model trained on the AVA dataset [9]. Regarding the extraction of the scene-related concepts, we utilized the publicly available VGG16 model fine-tuned on the Places365 dataset.
Object detection scores were extracted using models pre-trained on the established MS COCO and Open Images V4 datasets, with 80 and 500 detectable objects, respectively, and the bounding box information for each detected object was used for assigning the detection to one of the 9 considered keyframe regions. Finally, for the style-related concepts we employed the pre-trained models of [10].

Text to Video Matching module
This module compares a complex free-text query with a set of keyframes and returns a ranked list with the most correlated keyframes. Following the method proposed in [11], we use an architecture that learns to represent a textual instance (e.g. a sentence) and a visual instance (i.e. a keyframe) into a common feature space. Therefore, the correlation between a given text S i and an image Im j is directly comparable in the common space. For this, a dual encoding deep neural network that projects a natural language sentence and a shot keyframe into the common feature space is used. The network performs multi-level encoding in parallel, for both sentence and keyframes. A pre-trained Resnet-152 model is used for the initial keyframe representation, whereas each word sentence is initially encoded as a bag-of-words vector. Then, both the sentence and the keyframe representations go through three different encoders (i.e. mean-pooling, bi-GRUbased sequential model [12], and biGRU-CNN [13]). To train this module, we followed the approach of [14], and in terms of training data we combined two datasets, TGIF [15] and MSR-VTT [16]. The TGIF dataset contains approx. 100k short animated GIFs with one short description per each, while MSR-VTT consists of 10k short video clips, each accompanied by 20 short descriptions.

Automatic Speech Recognition
Acoustic content from videos is also exploited, by extracting audio channels and applying Automatic Speech Recognition (ASR) on them, in order to produce speech transcriptions for the whole collection. The basis for ASR is the open source framework CMU Sphinx-4 [17], a widely used, portable and flexible ASR system. The main components of the CMU Sphinx-4 Transcriber are a) a phonetic dictionary, which contains a mapping from words to phones, which are the basic units of speech, b) an acoustic model, which contains acoustic properties for each unit of speech, and c) a language model, which provides word-level language structure, by defining which word could follow previously recognized words and significantly restricting the matching process by stripping words that are not probable. Existing open source language and acoustic models are used in the context of VERGE platform. A priori extracted transcriptions and provided metadata are then fed into a text-based search module that uses Apache Lucene and enables the identification of a video by using words from the plot.

Video Captioning -Caption-Based Search
This module describes each video by a sentence/caption that is constructed from words included in a vocabulary, and thus the user can retrieve videos by simple text search. Video captioning approaches comprise two separate components: i) a feature extractor that typically extracts the features of a video by sampling among the frames using a fixed number as a step, and ii) an encoder-decoder that encodes the content and subsequently assigns it to words. To address this, an RNN-based neural network is used similar to [18]. The model is pre-trained on MSR-VTT [16], a widely-known dataset in video captioning domain. Finally, an approach based on [19] using reinforcement learning is implemented.

Multimodal Fusion and Temporal Search
This module fuses the results of two or more search modules, such as the visual descriptors (Section 2.1), the concepts (Section 2.2) and the color features mentioned in Section 3. Similar shots are retrieved by performing center-to-center comparisons among keyframes by using the selected modules. The query is described with multiple features (e.g. a shot, a color and/or concepts) and one of the features is considered by the user as dominant (i.e the most important one). The system returns the top-N relevant shots by considering solely the dominant feature (e.g. color), and then the other features are used for re-ranking the initial list by using a non-linear graph-based fusion method [20]. In order to perform temporal search, a query using multiple features of two adjacent shots is received, the top-N relevant images for one of the query shots are retrieved and finally this list is re-ranked by considering the features of the adjacent shot.

VERGE User Interface and Interaction Modes
The VERGE web application (Fig. 2) aims to provide end users with a friendly and effective way to utilise the developed retrieval algorithms, in a modern environment. Since this year we decided to incorporate a large number of modalities to offer more search options, our main goal is to serve them to the user in a non-complex way.
The VERGE user interface consists of three principal components: (i) a dashboard menu on the left, (ii) a results panel that covers most of the screen, and (iii) a filmstrip on the bottom. The menu contains a countdown timer that shows the remaining time to submit during VBS, a slider that adjusts the size of results, a back button that restores outcomes of previous queries and a switch button that defines whether a retrieval module will bring new matching shots or rerank existing results. Next, the various search modules are visualised as boxes that can be expanded or collapsed for reasons of compactness. In detail, Concepts and Filters present the entire list of visual concepts and filters respectively (Section 2.2), while both provide the option of auto-complete search. The selection of multiple concepts is also supported. Colors is a color palette in order to retrieve images of a specific shade. Text Search looks for the typed words in the video metadata, in the speech-to-text transcriptions (Section 2.4), and/or in the summaries described in Section 2.5, and it can also map the words to visual concepts and return most relevant shots (Section 2.3). Furthermore, Combination allows To illustrate the capabilities of VERGE, a simple scenario is described, where users try to find shots of a couple hugging in a black-and-white movie. Search can be initiated by applying the "B/W" filter from the available list of filters and then combining it with the concept "two people". Once a relevant image appears among the results, then visual similarity can be used in order to retrieve more similar shots. An alternative strategy is to look for relevant keywords (e.g., "couple old movie") inside the metadata, the transcripts and the video summaries.

Future Work
Since some of the aforementioned retrieval modalities are introduced to VERGE for the first time, we will evaluate their performance during the VBS contest and we will decide accordingly on their further enhancement or modification.