ObjectGraphs: Using Objects and a Graph Convolutional Network for the Bottom-up Recognition and Explanation of Events in Video

In this paper a novel bottom-up video event recognition approach is proposed, ObjectGraphs, which utilizes a rich frame representation and the relations between objects within each frame. Following the application of an object detector (OD) on the frames, graphs are used to model the object relations and a graph convolutional network (GCN) is utilized to perform reasoning on the graphs. The resulting object-based frame-level features are then forwarded to a long short-term memory (LSTM) network for video event recognition. Moreover, the weighted in-degrees (WiDs) derived from the graph's adjacency matrix at frame level are used for identifying the objects that were considered most (or least) salient for event recognition and contributed the most (or least) to the final event recognition decision, thus providing an explanation for the latter. The experimental results show that the proposed method achieves state-of-the-art performance on the publicly available FCVID and YLI-MED datasets 1.


Introduction
The recognition of high-level events in unconstrained videos is one of the major research topics in multimedia understanding. Adopting popular definitions in the literature [3], a high-level event is a long-term spatially and temporally dynamic activity, e.g. a "birthday party", encompassing multiple objects or actions [23,13], e.g., visitors, birthday cake and dancing stage, that are loosely organized spatially and temporally. This definition clearly differentiates the above research domain from the human action recognition one, which deals with the recognition of fine-grained elementary actions of a human being by capturing the subtle differences between similar actions, e.g., "running" and "walking" or "drinking beer" and "drinking wine".
A key element of event recognition approaches is the method used to extract the features for representing the video. According to this, the various approaches can be categorized as follows. i) Handcrafted: Mostly older methods using low-level features, e.g. improved dense trajectories [25]. ii) C2D: Techniques that utilize deep convolutional neural networks (DCNNs) with 2D convolutional kernels to extract the static event-related information at frame-level, and subsequently utilize an appropriate technique to capture the temporal dynamics of the event [26,22,15,32,30,31,17,19]. iii) C3D: DCNNs that use 3D convolutional kernels to encode simultaneously the spatiotemporal event information in videos [24,28,8].
The two latter categories described above have shown superior event recognition performance due to the ability of DCNNs to extract the appropriate features that separate well the different event classes. The majority of them operate directly into the overall video frame (C2D) or the entire video (C3D) in a top-down manner, i.e., utilize a single event label for each video through a cross-entropy loss function to learn to focus implicitly into the video regions that are mostly related with the specified event. However, in this way they fail to fully exploit the discriminant information carried by the multiple semantic entities related with the underlying event, as well as to provide a human understandable explanation of their event classification decisions.
The above limitations can be alleviated by either utilizing a suitable top-down approach and an appropriate video dataset for holistic representation learning [5], or by employing a bottom-up mechanism to attain a rich representation of the video content at each frame. To this end, inspired from recent advances in other video understanding domains [1,28,13], we follow the latter direction. Firstly, an object detector (OD) [20,1] and a graph convolutional network (GCN) [16] are utilized to derive a feature vector representation for the objects most likely depicted in the frame as well as the relationships among them, thus, obtaining an object-level representation of the video content at each frame. Subsequently, a long short-term memory (LSTM) [11] is used to encode the temporal dynamics of the frame representations and recognize the underlying event. Furthermore, during the testing phase, the weighted in-degrees (WiDs) of the graph vertices are used to identify the objects and their regions at frame-and video-level that mostly contributed to recognizing the event. In this way, our approach effectively provides a recounting of the recognized event: an object-grounded explanation of the model's outcome [10,7]. The proposed approach is evaluated in two publicly available datasets, namely FCVID [14] and YLI-MED [3], producing state-of-the-art results. In summary, the main contributions of the paper are the following: • We present a bottom-up video event recognition approach that, combining effectively relevant deep learning technologies (OD, GCN and LSTM), identifies and exploits the objects appearing in the video and their semantic relations.
• We utilize the WiDs of the derived graph's adjacency matrix in order to provide a recounting of the recognized event consisting of its key semantic entities (objects) at frame-and video-level.
The paper is structured as: Related work is discussed in Section 2. The proposed method is presented and evaluated in Sections 3 and 4. Conclusions are drawn in Section 5.

Related work
Early event recognition methods used hand-crafted features with quite good results [25]. However, over the last years, DCNN-based approaches have dominated this domain due to their groundbreaking performance in a variety of tasks. The C2D approaches extract 2D spatial convolutional features and model the temporal dimension independently. For instance, in [26] short snippets are extracted, modeling the long-range temporal structure of the video more effectively. Spatiotemporal VLAD (ST-VLAD) is presented in [22], encoding convolutional features across different segments to represent the video. In [15], PivotCor-rNN is proposed, exploiting correlations among different video modalities. S2L is introduced in [32], utilizing a pretrained ResNet and an LSTM to model separately the spatial and temporal video information. LiteEval in [30] uses a coarse and a fine LSTM operating cooperatively through a conditional gating module. In [31], AdaFrame exploits a policy gradient method to select future frames for faster and more accurate video predictions. In [17], SCSampler uses a lightweight saliency model to select the most salient temporal clips within a long video. In [19], the adaptive resolution network (AR-Net) selects on-the-fly the optimal frame resolution for classifying the video, outperforming the other methods in the FCVID dataset. In contrast to C2D approaches, C3D ones learn the space and time information jointly by exploiting 3D convolutions. For instance, in [24], convolutional 3D features are exploited by a linear support vector machine (C3D+LSVM) for video classification. In [8], the large-scale Kinetics dataset is used to derive 3D-CNNs of high depth for transfer learning applications.
The above methods learn to recognize a specified event in a top-down manner, i.e. a single event label is used for implicitly teaching the deep neural network to focus on the most salient features for the specified event in the video. A major drawback of this approach is that discriminant information contained in the multitude of semantic entities appearing in a video may not be fully exploited for recognizing the underlying event. Recently, the utilization of a bottom-up mechanism to provide a richer representation of the video content has been explored in the domains of visual question answering [1] and action recognition [28,13]. More specifically, in [1], a bottom-up mechanism is implemented using a Faster R-CNN [20], resulting in improved image captioning. In [13], Faster R-CNN and RelDN [34] are used to extract objects and visual relationships, and construct spatiotemporal scene graphs [29] for action recognition. In [28], a 3D-ResNet backbone is combined with Faster R-CNN and GCN to represent videos as space-time region graphs for the classification of elementary actions. Inspired from the above works, a bottom-up event recognition approach is proposed here, which in contrary to [28], utilizes a Faster R-CNN with ResNet-101 backbone, a pretrained ResNet-152 as feature extractor and a GCN to derive an object graph for each frame. That is, we represent each video as a sequence of graphs instead of a space-time region graph because, despite the fact that C3D approaches have shown promising performance in the recognition of elementary human actions, recent studies suggest that C2D methods can better encode the long-term dependencies and compositional nature of complex high-level events [12]. This is due to the very different nature of the two problems, as for instance, only subtle differences may be observed in subsequent video frames depicting a "short-term" human action (e.g. "lifting the telephone"), while, in event videos such differences may be dramatic, and thus, difficult to capture using a C3D in the entire video. Furthermore, during the testing phase, the use of a C2D backbone network (instead of C3D) allows the association of each frame with a graph and subsequently the utilization of a mechanism (i.e. the computation of the WiDs of the derived graph's adjacency matrix) to derive the most salient objects in the frame related with the recognized event.

Problem formulation
Suppose an annotated training dataset of N videos and C event classes. Keyframe sampling is performed to obtain a sequence of Q frames for each video, and an OD is used to detect K objects at each frame, representing each object with its label, a bounding box (BB), a feature vector and a degree of confidence (DoC) value. Based on this formulation, the overall dataset can be described as where, y i ∈ [1, . . . , C] is the event class label, X (i,j) represents the jth frame of the ith video, where x (i,j) k ∈ R F is the feature vector representation of the kth object detected by the OD at frame (i, j), the feature vectors in the rows of X (i,j) are sorted in descending order based on their DoC value, F is the dimensionality of the feature space R F , u , and P is the number of object classes. Given the above formulation, a network architecture combining a GCN and an LSTM structure is used to learn the spatiotemporal dynamics of high-level events. The overall architecture is shown in Fig. 1 and explained in detail in the next subsections.

Object detector
In order to obtain a precise representation of the underlying event at each frame, we need to focus on the frame regions conveying the high-level semantic information that can help us recognize the event and discard the noisy or irrelevant frame parts. To this end, a bottom-up procedure is adopted in order to obtain K objects at each frame [1]. Specifically, a ResNet-101 [9] pretrained and fine-tuned in ImageNet1K [21] and Visual genome [18] datasets, respectively, is used as a backbone network of a Faster R-CNN architecture [20]. Applying the latter to an input frame (i, j) means that a convolutional feature map output is derived, region proposals are produced using the region proposal network (RPN) module sliding over the obtained feature map, and several BBs with corresponding DoC values at multiple scales and aspect ratios are generated using the box-regression and -classification networks. All the BBs are sorted according to their DoC value, and a nonmaximum suppression procedure with an intersection-overunion (IoU) threshold is applied to retrieve the top-K proposed regions along with the corresponding object class labels u (i,j) k , k = 1, . . . , K. Subsequently, the region of interest (RoI) pooling layer is used to extract an H × H feature map and obtain the respective coordinates of the K regions in the input frame.
Each of the K regions derived above is fed to a feature extractor retrieving a feature vector x (i,j) k capturing the appearance of the kth object in the frame. As feature extractor we utilize the pool5 layer of a ResNet-152 trained on the ImageNet11K dataset [21].

Graph construction
The appearances of the objects in frame (i, j) and the interrelations among them are encoded by constructing a directed graph G (i,j) (V (i,j) , E (i,j) ) where, V (i,j) is the set of vertices and E (i,j) of the edges. In order to avoid a notation clutter, the superscript (i, j) is dropped in the rest of this and next subsection. In our setting, the vertices in V are represented by the K vectors associated with the objects in the frame, sorted in descending order according to their DoC values, x 1 , . . . , x K , as explained in Eq. (2). A matrix S ∈ R K×K is then constructed using the following pairwise similarity measure [27,28] [S] l,k =ṽ T lvk , where [S] l,k is the element of S in the lth row and kth column. In the expression above, similarly to [27,28],ṽ l anď v k are derived using two different affine transformations on the object feature vectors, v l =Wx l +b,ṽ k =Wx k +b, where,W,W ∈ R F ×F andb,b ∈ R F are optimized during the network training procedure. The adjacency matrix A ∈ R K×K of the graph is then computed as [33] [A] l,k = i.e. the weighted out-degree of each vertex is normalized to one.

Graph convolutional network
An M -layer GCN is used to exploit objects' information encoded in the frame-level graphs in order to learn discriminant graph embeddings for event recognition. Given the adjacency matrix A (Eq. (5)), the mth graph convolutional layer is implemented as [28,16] where, LN(), ReLU() are the layer normalization [2] and rectified linear unit operators, is the weight matrix at layer m, the rows of X [m] ∈ R K×F [m] are the hidden feature vectors corresponding to the K objects in the frame, X [0] equals X defined in Eq. (2), i.e. consists of the object feature vectors extracted using the OD, and F [m] is the dimensionality of the feature vectors at the mth layer.

Event recognition
A single feature vector for the (i, j) frame is obtained as explained in the following. The output of the GCN is passed through an average pooling layer, yielding a local feature vectorz (i,j) ∈ R F [M ] . A global feature vectorẑ (i,j) ∈ R F is also obtained by applying the feature extractor of the OD to the entire frame (i, j) (i.e. the frame is fed to the ResNet-152 pretrained on the ImageNet11K dataset and the output of the pool5 layer is used to representẑ (i,j) , similarly to what is described for specific regions in Section 3.2). The two feature vectors are then concatenated to form z (i,j) ∈ R F +F [M ] , encoding both the local and global frame information. Next, a standard LSTM layer [11] is utilized to capture the temporal dynamics of the event along the different frames where h (i,j) is the hidden state vector 2 . The hidden state vector h (i,Q) at the last time step of the video sequence is forwarded to a fully connected (FC) classification head (in our experiments we use two FC layers with an appropriate nonlinearity, i.e. softmax or sigmoid), providing a score value for each event in the dataset.

Explanation of event recognition results
During the forward signal propagation in the proposed network architecture the adjacency matrix amplifies the contribution of the objects relevant to the event depicted in the scene, and in contrary attenuates the contribution of the irrelevant ones. To this end, during the testing phase, the adjacency matrix A (i,j) (Eq. (5)) associated with the frame (i, j) is employed to derive the set of objects whose contribution to the feature vectorz (i,j) was amplified and thus mostly contributed to the recognition of the specified event. Firstly, the WiD γ (i,j) k of the kth graph vertex is computed as follows The computed γ (i,j) k corresponds to the kth detected object and thus can be associated with its object class label u (i,j) k (see Eq. (2)) and the respective BB. We treat γ (i,j) k as an indicator for the contribution of the kth object in associating the frame (i, j) with the recognized event. Therefore, these quantities can be used (e.g. by means of mean-or maxpooling) to provide some form of explanation for the event recognition result. Here, for each object class p we compute the "average" WiDs at frame-and video-level, ζ respectively, as shown below where N p denote the number of objects belonging to class p detected in frame (i, j) and entire video i, respectively. Then, a set of indices Z (i,j) corresponding to the ϑ most salient objects at frame (i, j) can be derived using where the sort ϑ operator returns the indices of the ϑ largest values in its input. Following a similar procedure, the ϑ largest δ (i) p (Eq. (10)) can be derived and used to obtain the most salient objects in video i.

Experimental evaluation 4.1. Datasets
We run experiments on two publicly available video datasets: i) FCVID [14] is a multilabel video dataset consisting of 91223 YouTube videos annotated according to 239 categories. It covers a wide range of topics, with the majority of them being real-world events such as "group dance", "horse riding", "birthday", "making cake" and other. The dataset is evenly split into training and testing partitions with 45611 and 45612 videos, respectively. Among them, 436 videos in the training partition and 424 videos in the testing partition were corrupted and thus could not be used. ii) YLI-MED [3] is a TRECVID-style video corpus based on YFCC100M, containing 1823 videos and 10 event categories. The dataset is divided into standard training and testing partitions of 1000 and 823 videos, respectively.

Setup
Uniform sampling is first applied to represent each video with a sequence of Q = 9 frames. Noting that in both datasets, videos' duration ranges from few seconds to several minutes, this yields sparsely sampled video sequences. In the FCVID dataset, our model is trained as follows: The OD described in Section 3.2 is used to derive K = 50 objects for each video frame, where each object is associated with a BB, an object class label and a feature vector of dimensionality F = 2048. The size of the feature maps extracted from the RoI pooling layer of the Faster R-CNN is set to 14 × 14 (i.e. H = 14). Moreover, the feature extractor described in Section 3.2 (i.e. the pool5 layer of a pretrained ResNet-152 on ImageNet11K) is applied on the entire frame to derive a 2048-dimensional feature vector, encoding the global appearance information. The extracted feature vectors are then utilized for learning the GCN, LSTM and FC layer parameters of our model, as explained in Section 3. We use a two-layer GCN with 2048 hidden size for each layer (i.e. M = 2, F [m] = 2048, m = 1, 2), an LSTM layer of hidden size 4096, two FC layers with 2048 and 239 units, respectively, and a sigmoid nonlinearity is utilized on the last FC layer to facilitate multilabel learning.
A two-stage procedure is applied for training our model. Initially, the overall network is trained for 60 epochs using Adam optimizer, batch size 64, cross-entropy (CE) loss, learning rate 10 −4 reduced by a factor of 10 at epoch 50, and a dropout rate of 0.5 is applied between the two FC layers. In the second stage, the GCN is frozen and utilized as a feature extractor for further optimizing the parameters of the LSTM and FC layers, using a learning rate of 10 −5 and 10 epochs in total.

Results
The proposed approach is evaluated on FCVID using the mean average precision (mAP) and compared against the top-scoring approaches of the literature, i.e. PivotCorrNN [15], LiteEval [30], AdaFrame [31], SCSampler [17], ST-VLAD [22] and AR-Net [19]. On YLI-MED, the top-1 accuracy is utilized, and the comparison is performed against the top-scoring literature approaches for this dataset, i.e. C3D+LSVM [24], 3D-CNN [8], TSN [26], ActionVLAD [22] and S2L [32]. The results on FCVID and YLI-MED are shown in Table 1 and 2, respectively. From the obtained results we observe the following: i) The proposed approach achieves the best performance in both datasets. On YLI-MED, we significantly improve the-state-of-the art by a large margin. On the much larger FCVID dataset, a small but significant performance gain of 0.2% is obtained over the previous best method in this dataset, despite the latter using a very strong backbone network (EfficientNet); its variant that uses a ResNet backbone has a mAP that is lower by 3 percentage points. Comparing our method, which has only been tested with a ResNet backbone, with the equivalent AR-Net variant, a significant performance gain of 3.3% is observed. ii) From the results on YLI-MED ( Table 2) we observe that the C3D approaches underperform in this task. This may be due to overfitting as this dataset is relatively small [8], or because the C2D approaches operating at the first level on individual frames and combined with the LSTM can capture more effectively the loose spatiotemporal structure and dynamics of the high-level events [12].   5)) for the frame of Fig. 2. We observe that certain object classes tend to produce highly influential graph nodes, indicating that they are very strong predictors of the recognized event.

Explanations of the event recognition results
In addition to the event label, our model can provide visual explanations concerning the event recognition outcome. This is performed by exploiting the WiDs of the graphs' adjacency matrix, as described in Section 3.6.
To illustrate how our model can be used to provide visual explanations at frame-level, Fig. 2 presents one frame of a video labeled "Wedding ceremony", while the graphs' adjacency matrix (Eq. (5)) corresponding to this frame is  9)), corresponding to the objects detected in the frame of Fig. 2. We observe that objects detected with a high DoC value are mostly unrelated with the recognized event. On the other hand, the objects associated with a high WiD (couple, men, people, woman, etc.) strongly correlate with the event ("Wedding ceremony"). depicted in Fig. 3. Moreover, in Fig. 4 the top bar plot presents the average DoC values derived using the OD, and similarly, the bottom bar plot shows the "average" WiDs (Eq. (9)) of the detected objects.
Using the WiDs, the detected objects can be ranked and used to produce visual explanations of the model's result. For instance, the three most and three less salient objects with relation to the recognized event along with their "average" WiD values are shown in Fig. 2 with green and red BBs, respectively.
From the example above, we see that our model tends to focus on the objects that are the most visually relevant to the recognized event and to ignore the irrelevant ones. Additionally, in contrary to the DoC values, which can provide a general overview of the scene, we observe that the WiDs contain valuable information about the event and can be used to identify the visually salient objects accurately.
In Fig. 5 we demonstrate the use of our model to provide explanations at video-level. Each row of this figure corresponds to a video from a different event category, consisting of a video frame, one bar plot depicting the ϑ = 10 objects with the highest "average" WiDs at video level (Eqs. (10), (11)) and a similar bar plot depicting the objects with the highest average DoC values along the video. We again observe that in all examples the proposed method focuses on a usually small part of the frame where the recognized event is occurring. For instance, the objects depicting a skater (although mislabeled as "dog" by the OD) and a woman fishing (labeled as "woman" by the OD) are identified as the most salient in the frame sampled from the video of the event "Person attempting a board trick" (second row of Fig.  5) and "Person landing a fish" (fourth row of Fig. 5), respectively. We also see that the top ten object class labels derived with our method in all cases can provide a sensible Figure 6. Visual explanation example for a video depicting "Working on a woodworking project" but mis-recognized as "Person attempting a board trick". From the bar plot we see that the most salient objects based on the "average" WiDs at video level (Eq. (10)) are "skate park" and "skatepark"'. These objects refer to the roof of the wood construction, which, as shown in the second frame, highly resemble a skate park, explaining why our model mislabeled this video.
recounting of the recognized event, while this is not true for the DoC-based recountings.
Finally, in Fig. 6 we provide an example where our model provided a wrong event recognition decision. From the bar plot in this figure we see that the top most salient objects are "skate park" and "skatepark", both associated with very high "average" WiDs. We also see that the roof of the wood construction depicted in the second video frame of Fig. 6 is very similar to a skate park, which explains why our model mislabeled this video.

Ablation study
We perform two ablation studies in order to gain further insight into the proposed approach. Firstly, we examine the influence of the main components of the proposed network architecture. More specifically, we evaluate the following three architectures: i) Global: the bottom-up mechanism, graph construction mechanism and GCN are removed from the network, i.e. only "global" featuresẑ (i,j) (Eq. (7)) are utilized for learning the event. ii) Global + local + FC: the graph construction mechanism and GCN are replaced by an FC layer, i.e the adjacency matrix A (Eq. (5)) is removed from Eq. (6). iii) Global + local + GCN: the entire network architecture is utilized. For simplicity, all the above networks are evaluated on FCVID using just the first-stage training procedure described in Section 4.2 (without freezing the GCN -thus, there is a small performance loss compared to the results reported in Table 1), i.e., end-to-end training for 60 epochs with Adam optimizer and CE loss, batch size 64, initial learning rate 10 −4 reduced to 10 −5 at epoch 50, and FC dropout rate of 0.5. The results are provided in Table 3. We see that the information provided by both the bottom-up mechanism and the graph related parts of the network (the graph construction mechanism and the GCN) have a strong impact on the performance of the network, providing an absolute gain of 2.3% and 1.3%, respectively. We also see that the graph construction mechanism and GCN cannot be sufficiently replaced by an FC layer, as it is shown that the FC layer cannot model with the same effectiveness the relations among objects.  In a second study, we examine the impact of the number of GCN layers in the performance of the model. The different architectures are evaluated on FCVID using the same training procedure as above. The results are shown in Table 4. We observe that the performance of the network slightly reduces when more than two layers are used, due to the known problem of over-smoothing [4].

Conclusions
We presented a new approach for video event recognition which exploits the relations among objects within each frame. More specifically, a graph, constructed using the appearance features of the objects, is exploited by our model to recognize the video event. Moreover, using the weighted in-degrees of the graph's adjacency matrix, our model is able to provide insightful explanations for its decisions. It is experimentally verified that this approach achieves stateof-the-art results on two popular video datasets.