Autonomous UAV Safety by Visual Human Crowd Detection Using Multi-Task Deep Neural Networks

Camera-equipped UAVs, or drones, are increasingly employed in a wide range of applications. Thus, ensuring their safe flight in areas containing people is a top priority. In this paper, a deep neural network-based method is proposed for the task of visual human crowd detection from UAV footage, allowing a drone to rapidly extract semantic segmentation maps from captured video frames during flight. These maps can be exploited (e.g., by a path planner) to define no-fly zones over, or near human crowds and, hence, enhance UAV flight safety. To this end, a novel neural architecture for binary (crowd/non- crowd) semantic segmentation from single RGB images is proposed, based on Convolutional Neural Networks (CNNs). It consists of a semantic segmentation and an image-to-image translation (I2I) neural branch. The overall network is trained using a novel multi-task loss function that addresses both tasks by processing the output of the corresponding branch. During inference, information flows across branches through additional skip synapses to further assist the crowd detection task. In order to evaluate the proposed method, we introduce a real and a synthetic human crowd RGB image dataset. The proposed method outperforms previous aerial crowd detection methods by a large margin and without any post-processing. Moreover, it demonstrates increased generalization ability, while running at real-time and near-real-time speeds on a ground computer and on embedded AI hardware, respectively.


I. INTRODUCTION
Over the last few years, Unmanned Aerial Vehicles (UAVs) have been utilized in various applications such as surveillance [1], area mapping [2] or search and rescue operations [3]. In similar scenarios, UAVs might be required to operate near groups of people, raising significant safety and legal issues due to possible malfunctions and/or regulations that forbid flight in the vicinity of human crowds. Relevant examples include infrastructure inspection in populated areas, or cinematography/media production applications [4]- [8], where it is typical to find crowds within the flight area (e.g., spectators of an outdoors sports event, etc.). Under such conditions, autonomous UAV operation requires special precautions.
Improved safety can be achieved by defining no-fly zones, in order to avoid operation near/over people. Human crowd detection on video frames captured from UAV cameras offers an effective solution, as safety can be ensured by visually recognizing crowded areas on-frame and, subsequently, actively avoiding them in 3D space (e.g., by back-projecting them onto the 3D area map [31] and correspondingly constraining the path planner). While a strict definition of human crowd is not commonly accepted, the national legislation of Germany prohibits UAV operation at a distance of less than 100 m from assemblages of more than 12 individuals, which is the crowd definition adopted in this work.
Human crowd detection entails detecting crowd and noncrowd regions on the 2D image/video frame. Previous methods approached the crowd detection problem either by applying a probabilistic model on extracted image features [9], [10], or by training a Fully Convolutional Network (FCN) [11] to classify video frame patches in two classes, crowd and non-crowd [12]- [14], [28]. Alternatively, Convolutional Neural Networks (CNNs) [16] were trained to perform crowd counting [20], [23]- [25] or directly regress crowd density maps [15], [17]- [19], from which human crowd regions may be obtained by applying image processing methods. Although these methods were able to predict heatmaps or density maps that indeed capture visible crowd regions, the region boundaries are not strictly delineated on the 2D video frame. Therefore, an extra post-processing step needs to be applied on the output heatmaps or density maps to obtain the final crowd and non-crowd regions. This extra post-processing step, which usually consists of simple image processing methods (e.g., thresholding/binarization, Gaussian blur, etc.), is not at all robust to distribution shifts between the training and the test set, while it adds undesirable computational complexity. The latter point is especially problematic in embedded systems with limited computational capabilities, as is typically the case with drones. In autonomous UAV flight, sluggish prediction of visible 2D crowd regions raises safety issues: when slow inference is combined with increased vehicle flight speed, crowd regions that need to be avoided may easily be missed.
To overcome the above issues, we propose transforming the crowd detection problem into a binary semantic image segmentation one, where each pixel of the input video frame is classified as belonging to either the crowd or the noncrowd class. Thus, more accurate (pixel-level) crowd region boundaries can be obtained, while a post-processing step is no longer necessary. Following this direction, the proposed method introduces a novel CNN architecture for rapid crowd segmentation in single RGB images. It utilizes a real-time semantic image segmentation CNN as the main neural branch and an Image-to-Image Translation (I2I) [22] network as the auxiliary branch to aid the main branch in the crowd segmentation task. This is accomplished through skip synapses that are added between them, in order to allow information flow from the I2I branch to the segmentation branch, thus providing extra context for crowd detection. The overall network is trained using a novel multi-task objective function that involves both semantic segmentation and I2I. Finally, we also introduce two human crowd segmentation image datasets, DroneCrowd and AirSimCrowd, which consist of real and synthetic UAV crowd images, respectively, along with their annotated segmentation maps. The proposed method was evaluated on both datasets, outperforming previous visual crowd detection methods while being significantly faster. Note that the proposed method is a generic, visual-based one requiring only an RGB camera, thus, it is directly applicable to any camera-equipped UAV.
In summary, the contributions of this paper are threefold: • A novel composite CNN architecture for human crowd detection is introduced, combining in parallel two neural building blocks (a semantic segmentation and a I2I branch) that utilize a common feature extraction backbone and additional skip synapses between them. • A novel multi-task loss function is employed for training the proposed architecture. • Two new human crowd segmentation image datasets are introduced for evaluating the proposed method.

II. CROWD SEGMENTATION
In this work, crowd detection is approached as a semantic image segmentation problem, where each pixel of the input UAV video frame is assigned a per-class probability for each of the two object classes (crowd/non-crowd). Thus, for an input resolution of M × N pixels, the output is a M × N × 2 crowd segmentation map. With this goal in mind, a novel deep CNN architecture for crowd segmentation is proposed, which combines a semantic image segmentation network with an I2I network to accurately predict crowd segmentation maps. he I2I neural branch is used to provide extra semantic information to the segmentation neural branch through skip synapses that connect the two branches, further assisting the crowd segmentation task. The two networks share a single backbone/feature extraction CNN and are jointly trained using a multi-task objective function.

A. Semantic Image Segmentation Branch
Given an input image/video frame x, semantic image segmentation assigns object class probabilities to each input pixel. Since human crowd positions in the 3D world might change dynamically during UAV flight, regular and frequent semantic video feed analysis is a necessity. Thus, BiSeNet [21] was employed as the baseline semantic segmentation neural branch, due to its real-time processing capabilities, and thus is briefly described below. Note, however, that any fast CNN for semantic image segmentation could be utilized in its place.
BiSeNet adopts a two-column network architecture consisting of two neural streams, namely, the Spatial Path and the Context Path. The Spatial Path is composed of a shallow CNN in order to learn high-resolution features that encode spatial information. In contrast, pre-trained stateof-the-art CNN architectures are utilized in the Context Path to encode high level semantic context information. Moreover, the features of each stage of the Context Path are refined using an Attention Refinement Module (ARM) to guide the learning process. As features from the Spatial and the Context Path encode different information, a Feature Fusion Module (FFM) was also utilized to effectively fuse the learned features. The final segmentation map is, then, obtained by upsampling the combined feature map to the output resolution. The loss function employed for training is the following one: where L p is the principal loss used to supervise the whole network and L ai is an auxiliary loss for stage i of the Context Path. α is used to weight the contribution of the auxiliary losses in the total loss. Both L p and L ai are standard Softmax loss functions. Finally, note that for an input video frame resolution of M × N pixels, the output of the semantic segmentation branch, as well as the corresponding groundtruth during training, is a M × N × 2 tensor.

B. Image-to-Image Translation Branch
Given paired training samples {x i , y i }, i = 1 . . . N , where x i ∈ X are images belonging to a source domain X and y i ∈ Y images belonging to a target domain Y, I2I methods [22] aim to learn a mapping, G : X → Y. G is typically represented by an encoder-decoder CNN architecture, trained under the conditional Generative Adversarial Network (GAN) [27], [38] framework. Conditional GANs consist of two competing networks, the generator and the discriminator. Given samples originating from the source domain, the generator aims to produce outputs that are similar to target domain samples and cannot be distinguished by the discriminator, which is adversarially trained to detect the generator's "fake" outputs.
In the proposed method, I2I is employed as an auxiliary task to aid semantic segmentation; the underlying intuition is that adversarial learning can complement typical supervised learning. Thus, during training, RGB images of resolution M ×N containing crowds serve as source domain data, while their corresponding ground-truth RGB segmentation images (tensors of size M × N × 3, derived by trivially processing the corresponding segmentation maps) are utilized as target domain data. RGB segmentation images constitute simply an alternative representation of the segmentation map groundtruth, one necessary for training the I2I neural branch of the proposed architecture. This branch corresponds to G, serving as the generator network whose objective is to learn the underlying mapping from real crowd images (source domain) to RGB segmentation images (target domain), while the objective of the employed discriminator D is to distinguish samples produced by G from ground-truth RGB segmentation images. As in typical conditional GANs, both G and D are trained in a supervised manner via the minimax game, min G max D L cGAN (G, D), where the objective function L cGAN (G, D) is given by [22]: In alignment with previous methods [22], [32], we also train G to not only fool D, but also to generate RGB segmentation images that are "close" to the corresponding ground-truth target domain images. As in [22], we utilize the L1 distance in the employed similarity loss function: Apart from L cGAN (G, D) and L s (G), which are typically used for training I2I networks [22], we also train the generator G to predict regular crowd segmentation maps, in order to prevent the backbone network from losing focus from our main task (crowd semantic segmentation). Therefore, the final objective function used to train the I2I neural branch of the proposed network is defined as follows: where L a (G) is an auxiliary semantic segmentation loss function for the penultimate convolutional layer of the decoder part of G, which is similar to the ones (L p , L ai ) used in Eq. (1).

C. Combining Semantic Segmentation with Image-to-Image Translation
The proposed method combines the semantic image segmentation branch and the I2I branch in a novel, unified network architecture for crowd segmentation, which is illustrated in Fig. 1. The employed semantic image segmentation neural branch consists of the Context Path and the Spatial Path, while the I2I neural branch consists of the generator G followed by the discriminator D. The backbone network (ResNet-18 [26]) of the Context Path is shared between the two branches, serving also as the encoder of the generator G. The decoding network of G is a CNN that uses both convolutional and deconvolutional layers and D is a standard PatchGAN [22] classifier, similar to the one used in [22]. Moreover, in order to allow information flow from the I2I branch to the semantic segmentation branch to enrich the extracted semantic features, skip synapses between neurons of the two intermediate stages of G decoder (I2I branch) and the segmenter's Context Path were added, conjoining the two branches. Importantly, as the final crowd segmentation maps are obtained from the main segmentation branch, the discriminator D and the two last convolutional layers of the generator G are necessary only during training, thus, they are discarded in deployment-time to avoid extra computational cost during inference.
The overall network is trained using the proposed multitask loss function that combines semantic image segmentation with image-to-image translation: where λ is a hyper-parameter used to adjust focus between the two tasks. The advantages of the proposed method are threefold. First, the proposed multi-task loss function assists the main crowd segmentation task, by effectively complementing typical supervised learning with adversarial learning. Since it is well-known that GANs are inherently resistant to overfitting [38], enhancing supervised training with an adversarial objective seems to have a regularizing effect. Second, the backbone CNN (ResNet-18) serves both as the encoder of the I2I branch and as the feature extractor of the crowd segmentation branch, thus saving significant computational cost and introducing additional regularization. Finally, the proposed parallel network architecture facilitates multi-task learning and allows the auxiliary I2I neural branch to further assist the main crowd segmentor through skip synapses.

III. EMPIRICAL EVALUATION
This Section provides a detailed description of the two human crowd image datasets we are introducing, along with the metrics used to evaluate the proposed crowd detection method. In addition, quantitative and qualitative performance evaluations of the proposed method are presented.

A. Datasets and Metrics
Although there are existing aerial-footage datasets for crowd counting (e.g., VisDrone [37]), they cannot be used for dense crowd detection since people appear scattered across the image, thus not forming crowd. Therefore, two suitable datasets were created and annotated to evaluate the proposed method: DroneCrowd and AirSimCrowd 1 .
The DroneCrowd dataset consists of RGB images depicting human crowds in a wide range of scenes (urban, countryside, day, night), captured at varying altitudes (from low to very high altitudes) and with varying crowd density (from tens to thousands people). In order to induce this diversity in the dataset, we included images from the Crowd-Drone dataset [13], the dataset used in [29], as well as newly-captured relevant aerial images 2 . For the latter, in total five separate UAV flight missions were performed over two different terrains using a custom drone equipped with an RGB camera mounted on a gimbal. In total 1700 images were manually annotated with their ground-truth segmentation maps using annotation software [36], resulting in a very diverse and challenging human crowd detection dataset. The image resolution varies from 480 × 360 to 1920 × 1080 pixels, rendering the dataset even more challenging. From these images, 1199 are used for training and 591 for testing. The train set consists of the train images from the Crowd-Drone dataset, Sequence3, Sequence8, Sequence9, Sequence10, Sequence11, Sequence16 from [29] and images captured during the three of the five performed missions, including both terrains. In a complementary manner, the test set includes images from the the Crowd-Drone test set and images captured during the remaining two UAV flight missions, ensuring that the train and test sets are mutually exclusive. Example samples from the DroneCrowd dataset can be seen in rows 1 and 2 of Fig. 2. 1 DroneCrowd and AirSimCrowd datasets are available at https:// aiia.csd.auth.gr/open-multidrone-datasets. 2 The MULTIDRONE project experimental media productions are the corresponding source.
The AirSimCrowd dataset is a synthetic crowd detection dataset obtained from the UAV simulation software AirSim [30]. AirSim is a photorealistic UAV simulation environment, built on top of the advanced Unreal 4 (UE4) real-time 3D graphics/physics engine, which allows programmatic interaction with the simulated UAVs via Remote Procedural Call (RPC)-based communication and offers tools for RGB image and ground truth annotation data extraction. In order to create the AirSimCrowd dataset, we simulated two scenarios. First, a cycling scenario on a mountainous environment, where a cyclist is set to traverse a pre-defined route with crowd gathered at random locations along the route. During the simulation, a UAV was set to follow the cyclist at a relatively constant speed while recording video with a RGB camera pointing at the cyclist. The second scenario involved a UAV following a predefined, random trajectory near crowds placed at random locations in the scene. The scene used in this second scenario is different from the one used in the cycling scenario, in order to induce diversity in the AirSimCrowd dataset samples. Overall, 602 RGB images at a resolution of 640 × 360 pixels along with their corresponding groundtruth segmentation maps were obtained from both simulated scenarios. Example video frames of the footage captured by the simulated UAV in both scenarios are depicted in rows 3 and 4 in Fig. 2. Note that all images in the AirsimCrowd dataset are used only for testing purposes and not for training.
The crowd detection performance of all methods was evaluated using the commonly adopted Intersection-over-Union (IoU) metric: where T P , F P and F N are the number of true positives, false positives and false negatives at pixel level, respectively. In addition, inference speed is measured both in ms and FPS.

B. Evaluation procedure
In all experimental sessions, all neural models were trained using the DroneCrowd train set. The proposed crowd detection method is compared to the baseline methods of [13], [15], [21]. The model of [13] consists of a simple FCN, which was trained by first extracting 128 × 128 pixel image patches depicting both crowd and non-crowd and subsequently training the FCN as a binary classifier, similarly to [13]. During inference, the trained model is able to predict crowd heatmaps by assigning crowd probabilities to each 128 × 128 pixel patch of the test image. Two variants of the method are used: a) the vanilla version from [13], denoted by F CN t , where the 2D crowd regions are obtained by thresholding the predicted crowd heatmap, resulting in a binary crowd map, and b) a variant slightly improved by us and denoted by F CN p , containing a final post-processing step to further refine the detected crowd regions. The postprocessing step consists in convolving the obtained binary crowd map with a Gaussian kernel, in order to fill potential gaps in the binary maps. Moreover, a state-of-the-art crowd analysis network using a simple encoder-decoder architecture Simple thresholding was applied to the network output. † Thresholding and Gaussian blur was applied to the network output. Simple thresholding was applied to the network output. † Thresholding and Gaussian blur was applied to the network output. Simple thresholding was applied to the network output. † Thresholding and Gaussian blur was applied to the network output. with dilated convolutions [33], i.e., CSRN et [15], was also adapted to our case and trained to predict grayscale segmentation images instead of crowd density maps (since we only care for detecting crowds and not counting them). During testing, similarly to F CN t , the final 2D crowd regions are obtained by simply thresholding the predicted output maps. Finally, the proposed method is compared against the semantic segmentation network BiSeN et [21] with a ResNet-18 as backbone, which was trained to directly predict crowd segmentation maps. Note that in all experiments, the best performing post-processing hyperparameters (threshold value and Gaussian kernel size) were selected for F CN t , CSRN et and F CN p .
The proposed network was simultaneously trained for both crowd segmentation and image-to-image translation tasks using Eq. (5), up to 200 epochs. The Adam [34] optimizer was used with batch size 16 and initial learning rate 0.001, which is reduced in each epoch using the "poly" learning rate strategy [21]. Similar to BiSeN et and CSRN et, our backbone network (ResNet-18) is pretrained on ImageNet [35], while λ in (5) was empirically set to 0.7, a value most beneficial for crowd detection. Moreover, the train set images were augmented online during training using random scale, cropping and horizontal flipping.
Experiments were performed for several input resolutions to demonstrate the performance-speed ratio offered by all competing methods. Typically, higher input resolutions facilitate crowd detection as people are more distinguishable in the image. However, in these cases, inference speed can be considerably decreased, especially when embedded hardware is used.
Detailed inference speed comparisons were made across all competing models (Proposed, F CN t , F CN p , CSRN et and BiSeN et), due to the crucial importance of fast (ideally real-time) execution in robotics applications. Notably, if human crowd areas are identified at a slow rate when the UAV is flying at a high speed, regions that need to be avoided may easily be missed, thus raising important safety concerns. Results in terms of inference speed (in msec) and FPS are presented in Table I. We also report the crowd heatmap prediction speed of [13] (without either thresholding or Gaussian blur) (F CN ) to evaluate purely the network's forward pass speed. However, in this case, accurately delineated 2D crowd regions can not be obtained. Moreover, experiments using both a Nvidia GTX 1080 Ti GPU and a Nvidia Jetson Xavier embedded AI computing board were conducted, in case the crowd detection algorithm/model is running on a ground computer or on-board the UAV hardware, respectively. Different input image sizes of 640 × 360, 1280 × 720 and 1920 × 1080 pixels (M × N ) were used to test inference speed at low, medium and high input resolution, respectively. The results indicate that the proposed method runs significantly faster than F CN , F CN t , F CN p and CSRN et, even achieving double the speed, or faster, for the highest resolution. When compared to the BiSeN et baseline, the proposed network architecture is slower only by 7.5 FPS in the worst-case embedded-execution scenario (640 × 360 input resolution on a Nvidia Jetson Xavier). This is not critical, as running speed remains real-time 3 .
In order to evaluate the crowd detection performance of the proposed and all competing methods in real-world aerial Fig. 3. Crowd detection results of the proposed method for real test crowd images from DroneCrowd dataset (rows 1 and 2) and unseen synthetic crowd images from AirSimCrowd dataset (rows 3 and 4). Each row depicts a triplet of corresponding input, ground-truth segmentation and predicted segmentation images.
images, the DroneCrowd test set was used. The IoU for both crowd and non-crowd classes of all models are reported in Table II. Results are reported at 640 × 360 and 1280 × 720 input resolution (by training and testing all models accordingly), where the running speed of the proposed method is over 20 FPS on a Nvidia Jetson Xavier, simulating a realworld autonomous UAV flight scenario. As shown in Table II, the proposed method significantly outperforms F CN t and F CN p , improving crowd class IoU up to 36% and 24% at low and medium input resolution, respectively. Moreover, the proposed method increased crowd detection performance, when compared to CSRN et and BiSeN et baselines, by a margin of up to 7% and 5%, respectively. Apart from increased detection performance of the crowd class, the proposed method was able to detect non-crowd regions more efficiently too. This is also important for autonomous UAV flight, as unnecessary actions or dangerous maneuvers can be avoided.
In real-world applications, however, the UAVs might operate in scenes that highly differ from the ones depicted in the train dataset, rendering generalization a necessary crowd detection model feature. In order to evaluate the generalization ability of all competing methods, the corresponding models that were trained on the DroneCrowd train set were tested on the AirSimCrowd set images. The IoU results of all models for both classes (crowd, non-crowd) at low and medium input resolution are presented in Table III. The proposed method demonstrates the highest generalization ability, increasing the crowd detection performance up to 34%, 12% and 7%, when compared to the models of [13] (F CN t , F CN p ), BiSeN et and CSRN et, respectively.
As shown in Tables I -III, the proposed method manages to outperform all competing methods both in crowd detection accuracy and in generalization, without sacrificing execution speed. This is due to its parallel network architecture, which simultaneously saves computational cost and allows multitask training and information exchange between the two neural branches, resulting in richer feature maps for crowd detection when compared to typical network architectures (F CN , CSRN et, BiSeN et).
Apart from the crowd detection performance reported in Tables I -III, a qualitative evaluation of the proposed model was also performed. The crowd detection results of the proposed method can be seen in Fig. 3, where random DroneCrowd and AirSimCrowd test images are depicted along with their corresponding ground-truth and predicted 2D crowd regions. The proposed method accurately predicts human crowds in 2D pixel space, with negligible false negatives and false positives, both for real and synthetic test images. In addition, the crowd detection results of the proposed method on a previously unseen, real UAV-captured video can be found in the following URL: http://bit. ly/crowd_det_results. An example video frame is depicted in Fig. 4. The proposed crowd detection method successfully detects crowd regions, even though its input is a completely unknown scene.

IV. CONCLUSION
In this paper, a deep neural network-based human crowd detection method from UAV video feed was presented. It is based on transforming the crowd detection problem into a semantic segmentation one. To this end, a novel neural architecture is proposed that combines a CNN-based realtime semantic image segmentation network branch with an image-to-image translation neural branch. The two neural pathways share the same feature extraction CNN and are jointly trained using a novel multi-task loss function that considers both tasks. Additionally, skip synapses were added between neurons of the two branches, allowing semantic information to flow from the I2I branch to the segmentation branch during inference. The proposed crowd detection method was evaluated using two newly introduced aerial crowd detection image datasets, DroneCrowd and AirSim-Crowd. The proposed method significantly outperformed all competing methods, while running at real-time and near-realtime speeds on a ground computer and on an embedded AI system, respectively.