Concept-Based and Event-Based Video Search in Large Video Collections

10.1002/9781119376996.ch2 https://zenodo.org/records/2649178 oai:zenodo.org:2649178 Foteini Markatopoulou Foteini Markatopoulou CERTH Damianos Galanopoulos Damianos Galanopoulos CERTH Christos Tselepis Christos Tselepis CERTH Vasileios Mezaris Vasileios Mezaris CERTH Ioannis Patras Ioannis Patras Queen Mary University Concept-Based and Event-Based Video Search in Large Video Collections Wiley 2019 concept-based search event-based search video concept detection event detection 2019-03-15 2021-03-15 https://zenodo.org/communities/moving-h2020 https://zenodo.org/communities/invid-h2020 https://zenodo.org/communities/emma-h2020 Creative Commons Attribution 4.0 International Video content can be annotated with semantic information such as simple concept labels that may refer to objects (e.g., “car” and “chair”), activities (e.g., “running” and “dancing”), scenes (e.g., “hills” and “beach”), etc.; or more complex (or highlevel) events that describe the main action that takes place in the complete video. An event may refer to complex activities, occurring at specific places and times, which involve people interacting with other people and/or object(s), such as “changing a vehicle tire”, “making a cake”, or “attempting a bike trick”, etc. Concept-based and event-based video search refers to the retrieval of videos/video fragments (e.g., keyframes) that present specific simple concept labels or more complex events from large-scale video collections, respectively. To deal with concept-based video search, video concept detection methods have been developed that automatically annotate video-fragments with semantic labels (concepts). Then, given a specific concept a ranking component retrieves the top related video fragments for this concept. While significant progress has been made during the last years in video concept detection, it continues to be a difficult and challenging task. This is due to the diversity in form and appearance exhibited by the majority of semantic concepts and the difficulty to express them using a finite number of representations. A recent trend is to learn features directly from the raw keyframe pixels using deep convolutional neural networks (DCNNs). Other studies focus on combining many different video representations in order to capture different perspectives of the visual information. Finally, there are studies that focus on multi-task learning in order to exploit concept model sharing, and methods that look for existing semantic relations e.g., concept correlations. In contrast to concept detection, where we most often can use annotated training data for learning the detectors, in the problem of video event detection we can distinguish two different but equally important cases: when a number of positive examples, or no positive examples at all (“zero-example” case), are available for training. In the first case, a typical video event detection framework includes a feature extraction and a classification stage, where an event detector is learned by training one or more classifiers for each event class using available features (sometimes similarly to the learning of concept detectors), usually followed by a fusion approach in order to combine different modalities. In the latter case, where solely a textual description is available for each event class, the research community has directed its efforts towards effectively combining textual and visual analysis techniques, such as using text analysis techniques, exploiting large sets of DCNN-based concept detectors and using various re-ranking methods, such as pseudo-relevance feedback, or self-paced re-ranking. In this chapter, we survey the literature and we present our research efforts towards improving concept- and event-based video search. For concept-based video search, we focus on i) feature extraction using hand-crafted and DCNN-based descriptors, ii) dimensionality reduction using accelerated generalised subclass discriminant analysis (AGSDA), iii) cascades of hand-crafted and DCNN-based descriptors, iv) multi-task learning (MTL) to exploit model sharing and v) stacking architectures to exploit concept relations. For video event detection, we focus on methods which exploit positive examples, when available, again using DCNN-based features and AGSDA, and we also develop a framework for zero-example event detection that associates the textual description of an event class with the available visual concepts in order to identify the most relevant concepts regarding the event class. Additionally, we present a pseudorelevant feedback mechanism that relies on AGSDA.