Concept Language Models and Event-based Concept Number Selection for Zero-example Event Detection

Zero-example event detection is a problem where, given an event query as input but no example videos for training a detector, the system retrieves the most closely related videos. In this paper we present a fully-automatic zero-example event detection method that is based on translating the event description to a predefined set of concepts for which previously trained visual concept detectors are available. We adopt the use of Concept Language Models (CLMs), which is a method of augmenting semantic concept definition, and we propose a new concept-selection method for deciding on the appropriate number of the concepts needed to describe an event query. The proposed system achieves state-of-the-art performance in automatic zero-example event detection.


INTRODUCTION
Multimedia-event detection is a very important task that deals with automatically detecting the main event presented in a video. As a Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. video event we consider a complex activity involving people interacting with other people and/or objects, e.g., "Renovating a home". Typically, multi-class classification is used to train event detectors on ground-truth annotated video samples. However, collecting ground-truth annotated data is difficult and time consuming. As a result, the more practically applicable but also more challenging zero-example event detection task has gained significant attention. The objective of this task is to retrieve the most closely related videos from a large video collection, given any abstract event description for which training samples are not available.
Recent studies typically start by analysing the textual event description so as to transform it to a meaningful set of keywords. At the same time, a predefined set of concepts is used, in the one hand to find which of these concepts are related to the extracted keywords and consequently to the event description, and on the other hand, to train visual concept detectors that will be used to annotate the videos with these semantic concepts. The distance between the event's concept vector and each videos concept vector is calculated and the videos with the smallest distance are selected as being the most closely related to the given event. In this work we improve such a typical system in the following ways: i) We adopt an efficient way for augmenting the definition of each semantic concept in the concept pool, ii) We present a new strategy for deciding on the appropriate number of concepts for representing the event query, iii) We combine these in a zero-example event detection method that outperforms the state-of-the-art techniques.

RELATED WORK
Zero-example event detection is an active topic with many literature works proposing ways to build event detectors without any training samples using solely the event's textual description. Research towards this problem was mainly triggered a few years ago when the TRECVID benchmark activity introduced the 0Ex task as a subtask of the Media Event Detection (MED) task [9]. A similar to zero-example event detection problem, known as zero-shot learning (ZSL), also appears in the image recognition task. A new unseen category, for which training data is not available, is asked to be detected in images [3,7,12]. It should be noted that although the two problems have many common properties, zero-example event detection is a more challenging problem as it focuses in more complex queries, where multiple actions, objects and persons interact with each other compared to the simple object or animal classes that appear in ZSL [19].
The problem of zero-example event detection is typically addressed by transforming both the event textual description and the available videos into concept-based representations. Specifically, a large pool of concept detectors is used to annotate the videos with semantic concepts, the resulted vectors, a.k.a. model vectors, contain the scores indicating the degree that each of the concepts is related to the video. The query description is analysed and the most related concepts from the pool are selected. Finally, the distance between the model vectors and the event concept vectors is calculated and the most related videos are retrieved [1,4,18,21].
Concept detectors are typically trained on external ground-truth annotated datasets using for example deep nets (DCNNs) or lowlever features from different modalities [10]. The simpler way of translating an event query into keywords is space separating the event's textual description, removing the stop-words and using simple NLP rules [16]. Then, each of the keywords is compared with each of the concepts and the top-related concepts are selected to represent the event. Typically, a fixed number is used to decide how many concepts will be selected for each event query [2,18]. However, adjusting the number of concepts based on the textual description has not been investigated. A semi-automatic approach is proposed by [2]; initially, the system automatically detects the concepts that are related to an event query. Subsequently, a human manually removes the noisy ones. The authors argue on the importance of such human intervention due to the big influence that the selection of the wrong concepts for an input query has on the system's accuracy. Furthermore, comparing each keyword with a single concept may be suboptimal. In some works the augmentation of concepts with synonyms is proposed, while the authors in [18] proposed a method where Concept Language Models are built using online sources, such as Google and Wikipedia, for augmenting the concept definitions with more information. In [20], logical operators are used to discover different types of composite concepts, which leads to better event detection performance. More clever ways of augmenting the concept pool should be found. In [4] instead of calculating concept-related event and video vectors both the videos and the event queries are embedded into a distributional semantic space, then the similarity between these representations is measured. Xiaojun et al. [11] proposed a zero-example event detection method, which initially learns a skip-gram model in order to find the semantic correlation between the event description and the vocabulary of concepts. Then, external video resources are retrieved and dynamic composition is used in order to calculate the optimal concept weights that will be aligned with each testing video based on the target event. Although this approach presents very promising results, the retrieval of external videos and the concept weight calculation are computational expensive. Finally, works in [14] and [13] focus on the improvement of the system's retrieval accuracy by using pseudo-relevance feedback.

PROPOSED APPROACH
In this section we present a fully automatic zero-example event detection system as presented in Fig. 1. The proposed system takes as input the event kit, i.e., a textual description of the event query, and retrieves the most related videos from the available event collection. An example of an event kit is presented in Fig. 2. As it can be seen it is a textual description of the requested event that includes the event's title, a short definition of the event and visual and audio cues that are expected to appear in those videos that contain this event. The complete procedure is split into two major components. The first component (upper part of Fig. 1) builds the event detector i.e., a vector of the mostly related concepts based on the event kit. The second component (lower part of Fig. 1) calculates the video model vectors, i.e., annotates the videos with semantic concepts. Finally, the output of the two components is compared, using a similarity measure, and the video model vectors that are closer to the concept event detector are retrieved.

Building an Event Detector
An event detector is a k-element vector d of the most related concepts to the event query. Each element indicates the degree that each of the k concepts is related to the target event query. To calculate the event detector we propose a method as follows (algorithm 1): Firstly, we check if the entire event title is semantically close to any of the available concepts from the concept pool, i.e., we check if the semantic relatedness between the event title and each of the concepts is above a (rather high) threshold. If so, we consider that the event is well-described entirely by this (or those) concepts and the relatedness value of these is used to form the event detector d. The Explicit Semantic Analysis (ESA) measure [17] is used to calculate the semantic relatedness of two words or phrases. The ESA measure calculates the similarity distance between two terms by computing the cosine similarity between their corresponding weighted vectors of Wikipedia articles. We choose this measure because it is capable of handling more complex phrases and not only simple words. If the above process does not detect any related concepts, then an Event Language Model (ELM) and Concept Language Models (CLM) are built as follows.
(a) Event Language Model (ELM) and Concept Language Model (CLM). An ELM is a set of N word and phrases that are extracted from the event kit. We build an ELM using the event title, and the visual and audio cues, by simply space separating them. A CLM is a set of M words or phrases that are extracted w.r.t. to a specific concept definition. A CLM is built for each concept using  the top articles in Wikipedia. The retrieved articles are transformed in a Bag-of-Words (BoW) representation from which the top-M words, which are the most characteristic words of this particular visual concept, are kept. For example, the top retrieved words for the concept "palace" are "palace", "crystal", "theatre", "season", "west", "east", "spanish", "gates", "hotel". where d j = [W 1, j ,W 2, j , . . . ,W M, j ]. The single values calculated per concept, by repeating the above process for every CLM, are concatenated into a single k ′ -element vector d ′ and a process is followed for deciding the appropriate number of concepts that will be finally kept for representing the event query.
(b) Event-based concept number selection. In contrast to [18] and [2], where the number of selected concepts is fixed across the different events and motivated by statistical methods such as PCA [15], where a fraction of components are enough to efficiently or even better describe the data, we propose a statistical strategy that decides on the appropriate number of concepts k, where k ≤ k ′ , that should be kept for an event query. Specifically, our strategy orders the vector of concepts scores d ′ in descending order, constructs an exponential curve, and then selects the first k concepts so that the corresponding area under the curve is at the X % of the total area under the curve. This procedure, consequently returns different number of selected concepts for different target events. For example for the event "Attempting ordering the a bike trick" the selected concepts are the following four: "ride a dirt bike", "mountain biking", "put on a bicycle chain", "ride a bicycle", while for the event "Cleaning an appliance" only the concept "clean appliance" is selected. The final event detector is a k-element vector that contains the relatedness scores of the selected concepts.

Video Annotation and Retrieval
Initially, each video is decoded into as set of keyframes at fixed temporal intervals. Then, a set of pre-trained concept-based DCNNs are applied to every keyframe and each keyframe is represented by the direct output of those networks. Finally, a video model vector is computed by averaging (in terms of arithmetic mean) the corresponding keyframe-level representations. Each element of a model vector indicates the degree that each of the predefined concepts appears in the video.
The distance between an event detector and each of the videolevel model vectors is calculated, and the h videos with the smallest distance are retrieved. As distance measure we choose the histogram intersection, which calculates the similarity of two discretized probability distributions and is defined as follows:

EXPERIMENTAL RESULTS
We use the TRECVID MED14TEST dataset [9] that contains approximately 25.000 videos. We evaluate all the methods on the 20 MED2016 [5] Pre-Specified events (E021-E040) for which event kits are provided. We use a concept pool that consist of 13.488 semantic concepts collected from two different sets: i) 12.988 concepts from the ImageNet "fall" 2011 dataset [8] and ii) 500 high level concepts from the EventNet [6] dataset. Each video was decoded into 2 keyframes per second and each keyframe was annotated with all the above concepts. In order to obtain scores regarding the 12.988 ImageNet concepts we use the pre-trained GoogLeNet provided by [22], while we use the EventNet [6] Table 2: Comparison between different types of CLM and the third one uses the event title, visual and audio-visual cues. According to Table 1 the more information is given for building an ELM the better the overall accuracy, i.e., the third ELM that uses the complete event kit description (except for the event definition) outperforms the other two that use sub-parts of the event kit.
Similarly, in Table 2 and Figure 3 we compare 2 different types of CLMs. The first CLM uses solely the concept name, along with any available description of it. The second CLM augments the concept name with terms in Wikipedia as described in Section 3.1; the top-10 words of the BoWs representation from the top-10 retrieved documents are used. Similar to the ELMs, the more information provided for building a CLM the better the overall accuracy, i.e., augmenting a concept with information captured from Wikipedia improves the video detection performance. We noticed that 7 out of 20 events had the same performance irrespective of the used CLM types. This happened because existing concepts in our initial pool can describe adequately these specific events.    Table 3: Comparison between different zero-example event detection systems top-k selected concepts in every event), affects the performance of our method. In Fig. 4 we observe that the better AP w.r.t the event "Horse riding competition" are achieved for small values of the AUC. This indicates that selecting more concepts that are not highly related with the event query adds noise to the retrieval process that consequently reduces the overall accuracy. Similar conclusions for the overall performance can be reached w.r.t. Fig. 5. The best performance is achieved when the 1% of the AUC is chosen. In this case the average number of selected concepts is 20.1, but each event has different number of concepts. For example the event "Attempting a bike trick" needs only 4 concepts while event "Winning a race without a vehicle" needs 38 concepts.
In our second set of experiments, we compare the proposed method with the following three state-of-the-art ones: i) AutoSQGSys System [13], ii) Concept Bank system [1] and iii) Tzelepis et al. zeroexample method [18], where a fixed number of selected concepts was used. The results of [13] and [1] are picked up from the corresponding papers while the [18] method was re-implemented in order to be suitable for our experiment set-up. According to Table  3 the proposed method outperforms all of the other approaches reaching a MAP of 0.133.

CONCLUSION
In this paper we present a fully-automatic method for zero-example video event detection. The augmentation of the concept descriptions with extra information in combination with the proposed strategy for deciding on the appropriate number of concepts for representing the event query outperforms all the state-of-the-art approaches presented in this paper.