Crowd Violence Detection from Video Footage

Surveillance systems currently deploy a variety of devices that can capture visual content (such as CCTV, body-worn cameras, and smartphone cameras), thus rendering the monitoring of video footage obtained from multiple such devices a complex task. This becomes especially challenging when monitoring social events that involve large crowds, particularly when there is a risk of crowd violence. This paper presents and demonstrates a crowd violence detection system that can process, analyze, and alert potential stakeholders, when violence-related content is identified in crowd-based video footage. Based on deep neural networks, the proposed end-to-end framework utilizes a 3D Convolutional Neural Network (CNN) to deal with the (near) real-time analysis of video streams and video files for crowd violence detection. The framework is trained, evaluated, and demonstrated using the Violent Flows dataset, a dataset related to crowd violence that is widely used for research. The presented framework is provided as a standalone application for desktop environments and can analyze both video streams and video files.


I. INTRODUCTION
Surveillance systems are widely used for numerous purposes, including, but not limited to, crime prevention, monitoring, and evidence discovery. The main characteristic of surveillance systems is the need for operational personnel to continuously monitor the available video footage. Given that, nowadays, various devices that can capture visual content are deployed for such purposes, including Closed-Circuit TeleVision (CCTV), body-worn cameras, and smartphone cameras, the monitoring of the video footage obtained from multiple such devices has become a complex task that requires the availability of high numbers of human resources. Moreover, as this is a highly demanding task, it may occasionally result in human operators missing the detection of events of interest.
Particularly challenging to monitor are social events (e.g., football matches and music festivals), which are becoming more and more crowded, and where there is often a risk of crowd violence erupting. An Artificial Intelligence (AI) based framework that could continuously monitor multiple sources of video footage and adequately inform the security personnel, in (near) real-time, whether activities and events of interest (such as crowd violence) are detected would be particularly advantageous, and efforts towards this have attracted significant interest in recent years.
In this work, we showcase a beta version of the crowd violence detection framework introduced in [1] that reports higher accuracy against several state-of-the-art approaches [2]- [5] and outperforms the deep-learning-based [6]- [8] approaches. The proposed framework incorporates a 3D Convolutional Neural Network (3D-CNN) [6] architecture that can process video footage in (near) real-time. More specifically, the presented framework can analyze video files and video streams by processing mini-batches of 16-length frames at every step. A demo is available at: https://m4d.iti.gr/crowd-analysis-tool/. Fig. 1. Illustration of the crowd violence detection framework subcomponents in a loop: first, the frames for further processing are sampled, next their features are extracted, followed by the predictor that estimates a violence level, and ending with the visualization in the graphical user interface.
Regarding the real-time processing, the proposed framework consists of four sub-components that run in a loop (see Figure  1); these sub-components are: • the frame sampler, which selects the frames for the next processing step; • the feature extractor, which encodes the visual input of frames to visual features; • the violence predictor, which analyzes the features and estimates a violence score; and • the visualization interface, which visualizes the results and notifies the end-user. This work is organized as follows. Section II describes the crowd violence detection framework and its sub-components. Section III demonstrates the framework while highlighting the information flow and the use cases. Section IV concludes by discussing future directions for improving the framework.

II. CROWD VIOLENCE DETECTION FRAMEWORK
Based on four sub-components, the proposed crowd violence detection framework executes a sequential loop whereby various sub-modules collaborate to detect violence-related scenes from video footage, namely video files and video streams. In this section, these sub-components are illustrated as designed for the processing of such video footage.

A. Frame sampler
The frame sampler is a sub-component that is enabled only when the analysis is performed using video streams; in the case of video files all frames are processed. The overall aim of the frame sampler is to extract 16-length sequential frames in order to prepare the input for the feature extractor. For the case of video files, all the video's frames are extracted using a sliding window of size equal to 16 and step equal to 16; on the other hand, a queue of frames is considered for the case of video streams. The queue size is equal to 32 as we want to retain the last second of the visual stream, assuming a frame rate equal to 30 frames per second (fps).
Every time a video frame is available from the stream, it is incorporated in the queue. When the next sub-component in the loop requests the next mini-batch for processing, a sampling of 16 from the 32 frames of the queue is forwarded when the queue is full; otherwise, mini-batches of 16 frames are sampled. Following this approach, the 16 frames of the minibatch comprise the representative samples of approximately the last second of streamed content, when considering a frame rate equal to 30 fps. Figure 2 illustrates the frame sampler's various functionalities for the video file and the video stream processing. On the top, the sliding window approach is presented. On the bottom, the queue approach is illustrated for the two different ways of sampling: (i) when the queue is full (depicted in yellow), and (ii) with sequential sampling (depicted in orange).

B. Feature extractor
The feature extractor is the core sub-component of the proposed framework. In this step, the video frames are encoded to visual features in order to be used for the prediction of the crowd violence level. To this end, a 3D-CNN ResNet [9] architecture trained using the Violent Flows [10] dataset is deployed. This dataset was selected as it involves a substantial number of crowd-related clips related to violence, compared to the more recent violence-related datasets, such as RWF-2000 [11]. For each mini-batch of 16 frames, the tool extracts a feature vector of 2048 length from the fully connected layer before the binary classifier. Then, the next step estimates the crowd violence level for each feature vector.
The extracted features have been learned using the Violent Flows dataset's training set consisting of 246 videos with an overall duration of 14.76 minutes that have a balanced split into the two categories of crowd violence/non-violence. Following a similar approach as in the recent literature, a five-folds cross-validation approach has been adopted splitting the dataset into four to one ratio for training and testing, respectively. The training process was performed using an NVidia RTX 2080ti GPU with 11GB memory for 100 epochs and lasted approximately 4 hours.

C. Violence predictor
The objective of the predictor is to estimate the crowd violence severity level for each mini-batch of 16 frames. This process could be integrated into the feature extractor; nonetheless, a real application prerequisites (near) real-time processing, which enables a detached prediction for each feature vector. The crowd violence level is estimated with a prediction score from 0.0 to 1.0 using a sigmoid function [12] as activation function on the last node of the architecture. The values closer to 0 denote non-violent scenes, while values close to 1 denote content related to crowd violence.

D. Visualization interface
Once the first mini-batch analysis of the 16 frames is completed, the results are visualized onto the user interface. For illustration purposes, an appropriate colour palette has been identified and incorporated, as depicted in Figure 3, in order to quantize the severity level representation.

III. DEMONSTRATION
This section illustrates the system's user interface, while some implementation details are also presented.

A. User interface
The user interface developed for the crowd violence detection framework involves three basic functionalities: (i) the source definition, (ii) the selection of video footage, and (iii) the video player, as illustrated in Figure 4.
The source definition is depicted on top and allows to define the video files or video streams to be processed. The user can browse to a folder that contains video files for analysis, or type a URL that corresponds to a video stream so as to initiate its processing. The framework can process video streams transferred through Real-Time Streaming Protocol (RTSP), the most common solution for such applications.
Once the source definition is completed, the user shall press the "Analyze" button to initiate the analysis of the corresponding source. After completing the analysis, the list of the analyzed videos is presented below the source definition box for the case of video files. For video streams, the list remains empty as our application does not support the storing of the video streams. For both types of sources, the video player is responsible for playing the analyzed videos and for visualizing the results onto the stream footage. Figure 5, shows a video of crowd walking on the road during a celebration day. The video is colorized with a green frame that indicates non-violent content, while the word 'Prediction' is depicted over the content, followed by the score of the crowd violence level, which is, in this case, equal to 0%. Figure 6 illustrates the analysis of a crowd that participates in a riot at a stadium's stands after the end of a football game. The frame is colorized with a red bounding box, while the crowd violence level score is estimated at 97%.
As it is challenging to illustrate and depict on paper how the proposed framework works in (near) real-time, we provide three samples that represent in a temporal manner the continuous monitoring performed by the framework. Figure  7 presents continuously sampled frames, every one second, for three videos. On the top, there is a crowd violencerelated sample that clearly illustrates the crowd violence level predicted between 97% and 100%. In the middle, a nonviolence related example is displayed, with the framework accurately predicting a score equal to 0% for each of the four frames. Finally, on the bottom, a non-violence related example is depicted that reports score close to 0% for the four indicative frames.

B. Implementation details
For developing, training, evaluating, and demonstrating the proposed crowd violence detection framework, specific configurations/libraries are utilized to support its deployment. The operating system is Linux based Ubuntu 18.04; the programming language used is python 3 [13] with additional libraries including, but not limited to, PyTorch and python-OpenCV, while for the graphical user interface, the pythontkinder library is used.

IV. CONCLUSIONS
This work presented a crowd violence detection framework and the associated application. The framework is focused on crowd-centred scenes and aims to detect and estimate the violence level in (near) real-time while processing video files or video streams. In addition, the various components comprising the framework, as well the user interface of the application were presented. Future steps include improvements towards reducing the latency when video streams are processed and enabling support for other platforms, while the simultaneous processing of multiple cameras will also be investigated.
ACKNOWLEDGMENT This work was supported by the Horizon 2020 projects CONNEXIONs (H2020-786731) and PREVISION (H2020-833115) funded by the European Commission.