Accelerating 3D scene analysis for autonomous driving on embedded AI computing platforms

—The design of 3D object detection schemes that use point clouds as input in automotive applications has gained a lot of interest recently. Those schemes capitalize on Deep Neural Networks (DNNs) that have demonstrated impressive results in analyzing complex scenes. The proposed schemes are generally designed to improve the achieved performance, leading however to high performing approaches with high computational complexity. To mitigate this high complexity and to facilitate their deployment on edge devices, model compression and acceleration techniques can be utilized. In this paper, we propose compressed versions of two well-known 3D object detectors, namely, PointPillars and PV-RCNN, utilizing dictionary learning-based weight-sharing techniques. It is demonstrated that signiﬁcant acceleration gains can be achieved with acceptable average precision loss when evaluated on the KITTI 3D object detection benchmark. These ﬁndings constitute a concrete step towards the deployment of high-performance networks in edge devices of limited resources, such as NVIDIA’s Jetson TX2.


I. INTRODUCTION
The continuously growing domain of Autonomous Vehicles (AVs) has gained interest in both academia and industry. AVs are considered as an integral component of connected intelligent transportation systems, improving performance and safety indicators of future mobility systems, providing safer transportation, efficient management of fuel consumption and upgrading the whole travelling experience.
One of the most essential operations executed at the AVs, to enable the aforementioned benefits, is the perception and understanding of dynamic and complex environments from sensor data coming from various modalities (e.g., Camera, Li-DAR, etc.). Camera-based scene analysis modules are sensitive to challenging illumination or weather conditions which can significantly degrade the quality of imagery data. On the other hand, LiDAR data are less affected by environmental changes and provide depth information directly, although their distinct sparse representation characteristics, bring new challenges that need to be addressed. Traditionally, a LiDAR analysis module processes the generated point clouds for detecting objects This work has received funding from H2020-ICT-2019-2 project CPSo-Saware Cross-layer cognitive optimization tools & methods for the lifecycle support of dependable CPSoS (Grant Agreement No. 873718) via several operations including background/road subtraction, followed by spatiotemporal clustering and classification [1].
The advances in deep learning are considered as the main driving force towards fast and effective scene understanding solutions. Many recent works have studied the strengths and weaknesses of Deep Neural Networks (DNNs) in detecting objects in point clouds, generated by LiDAR sensors [2]- [4]. While the impressive performance of DNNs in LiDARbased object detection is nowadays well-established, their high storage and computational costs become problematic especially in real-time applications, like the ones related to scene understanding in AVs.
In this work, we study the application of recently proposed Model Compression and Acceleration (MCA) techniques on state-of-the-art DNNs designed for LiDAR-based 3D object detection in AVs. Our comprehensive evaluation on the KITTI 3D object detection benchmark [5] involving the detection of cars, pedestrians and cyclists, demonstrates very promising results towards the goal of efficient and accurate scene understanding. Apart from a high-performance GPU-based platform, the 3D detectors were also deployed on NVIDIA's Jetson TX2, thus, revealing a promising future direction concerning the efficient design and execution of compressed models on edge devices, utilizing dictionary learning-based MCA approaches.
The rest of the paper is organized as follows. In Section II, the related bibliography for 3D object detection and MCA techniques is presented, along with the main contributions of the paper. Section III contains a brief description of the DNNs under stud and the used weight-sharing MCA techniques. Our experimental evaluation is described in detail in Section IV. Finally Section V concludes the paper.

II. RELATED WORK AND CONTRIBUTION
A. Object detection in LIDAR point clouds 3D object detection from LIDAR point clouds is mainly a data-driven task due to the lack of apparent structure in the data. Deep fully convolutional networks have been traditionally employed in the literature since 2016, with the main evolutionary elements to concern i) the transformation of the 3D point cloud, ii) the network structure and iii) the utilization of feedback loops or abstraction layers for a multiscale feature extraction approach. Initial attempts [6] projected the 3D points in a 2D level and used traditional 2D fully convolutional networks reporting accuracy of 71.0% for moderate difficulty cars. The use of 3D fully convolutional networks [7] increased the accuracy to 75.3% but also increased the computational complexity. Yan et al. [8] proposed sparse convolutional networks improving training and inference times and reaching a reported accuracy of 79.46% for moderate difficulty cars, according to KITTI benchmarks. Pointpillars [9] proposed a novel encoder that utilizes PointNets to learn a representation of point clouds organized in vertical columns (pillars) presenting an accuracy of 74.31% in the same category. 3D object detection with region proposals was introduced in PointRCNN of Shi et al. [10]. They used two stages, where the first stage generates 3D proposals and the second refines them reporting accuracy of 75.64%. An extension of PointRCNN is the Part-A 2 Net [11] encompassing a part-aware proposal part and an aggregation part reporting an of 78.49%. PV-RCNN [12] is another novel approach that combines 3D sparse convolutions in a voxelized space with region proposals and an abstraction layer reporting an average precision of 81.43%.

B. Model compression and acceleration
Model Compression and Acceleration (MCA) refers to a family of approaches to produce efficient deep models in terms of the required resources (computational, storage) during the inference phase of its operation. This is achieved by transforming an original, pre-trained deep model of high complexity to a new reduced version without affecting substantially the final performance, depending, of course, on the application. The MCA literature is thoroughly described in several recent survey papers like [13], [14] and [15], where the presented techniques can be utilized on the algorithmic or the hardware level, while, currently, there are also methods that take into account both algorithmic and hardware aspects. Many of the proposed techniques focus on pruning unimportant parts of deep networks like filters [16]. Other approaches limit the representation of the involved parameter by reducing their bitwidth or increasing common representations via scalar, vector and product quantization [17], [18]. In another direction, the involved quantities are modelled mathematically as tensors or matrices and decomposed into factors by exploiting inherent properties such as low-rankness [19]. Focusing on the utilization of MCA techniques on 3D object detection using point clouds, the literature is quite limited. An example can be found in [20] where the proposed deep model is transformed via a pruning technique that sets to zero unimportant parameters by exploiting the alternate direction method of multipliers [21].

C. Contribution
The efficacy/impact of more elaborate and high-performing MCA techniques has not been considered yet. In this paper, we focus on weight sharing approaches. In particular, the highlights of the paper are as follows: (i) Two recently proposed weight sharing techniques [22], [23] are utilized on two well-known 3D point-cloud object detection frameworks, namely, PointPillars [9] and PV-RCNN [12]. (ii) The results obtained on the KITTI 3D object detection benchmark [5] reveal considerable acceleration gains, with limited accuracy loss, on both examined models. (iii) Full-model acceleration of up to 9.2× in the case of PointPillars and up to 6.3× regarding the targeted part (namely, its 2D con-layers) of PV-RCNN, has been achieved. (iv) Relative performance drop of PointPillars ranges from being negligible for the class "car" (1 − 5%) to acceptable for "pedestrian" (18 − 21%), and "cyclist" (11 − 16%), across the difficulty levels of the KITTI dataset. (v) For the case of PV-RCNN, mostly negligible losses across the range of classes and difficulty levels, even proving beneficial in specific cases, were observed.
Here, the two object detection schemes that will be considered, namely, PointPillars and PV-RCNN, are briefly presented. The PointPillars network [9] introduces the notion of a Pillar. Based on those Pillars, this network removes the need for 3D convolutions, which have been central to networks like VoxelNet [2] and Second [8], by utilizing strictly 2D convolutions, thus, achieving both high precision and fast inference.
The architecture of PointPillars consists of three stages as depicted in Fig. 1(a). The first stage transforms the point cloud into a pseudo-image. By grouping the points of the cloud into vertical columns, called pillars, that are positioned based on a partition of the x − y plane, this stage summarizes the information of the points per pillar into 1D vectors. These vectors are rearranged appropriately to construct the pseudoimage that will feed the next stage. The second stage consists of a feature extraction backbone network that provides a highlevel representation. This representation is subsequently processed by the third stage which is the adopted object detector, producing 3D bounding boxes and confidence scores for the classes of interest. In terms of computational complexity, the backbone network of the second stage, consisting of a number of 2D convolutions and 2D transpose convolutions, requires more than 95% of the involved operations and, thus, we will focus on this stage in the following for its acceleration.
The second object detector for point clouds that will be considered in this paper, is the recently proposed PV-RCNN [12]. On one hand, this system capitalizes on ideas from grid-based object detectors that transform the irregular point clouds into regular representations that can be processed efficiently by ordinary convolutional layers, limiting, however, their performance by the resolution of the adopted grid. On the other hand, PV-RCNN also exploits ideas from point-based object detection, which operates directly on the points of the cloud, removing the need for point cloud discretization with increased, however, computational complexity.
In more detail, the architecture of PV-RCNN is depicted Fig. 1(b). PV-RCNN, first, transforms the point cloud into voxels which are subsequently processed by a voxel backbone network consisting of 3D sparse convolutions. Based on its output, 3D region proposals are produced using the BEV backbone network whose structure is depicted in Fig. 1. PV-RCNN also samples a subset of the points, named key points, and associates summary features by appropriately concatenating information extracted by the key points themselves and corresponding information extracted from different layers of the voxel backbone network. The information of the 3D region proposals along with the key-point features are appropriately merged over equispaced grid points defined in each 3D region proposal. This enhanced information is eventually processed to procedure the improved final confidence scores for the classes and the 3D bounding boxes. In the following, we will focus on the BEV backbone for studying the impact on the performance of the PV-RCNN.
Viewing the convolution operation as a summation over dot-products between input and kernel (depth-or channelwise) vectors, Product Quantization (PQ) aims at reducing the required number of dot-products by limiting the number of allowed representations for the kernel vectors (thus sharing results between them), using a Vector Quantization (VQ) framework.
To be more specific, PQ first obtains a suitable partition of the kernel vector space into a predefined number of subspaces and then applies VQ in each of them, that is, it estimates a codebook of representatives that "best" approximate the original sub-vectors, according to some appropriate metric. Following this approximation scheme, only the dot-products between the codewords and the input need to be calculated. The obtained results are subsequently shared (substituted) accordingly among the sub-vectors represented by each codeword. It becomes obvious that the size of the used codebook represents a trade-off between the achieved acceleration/compression and the induced approximation error. Typically, codebook design is addressed by applying the k-means algorithm on the original sub-vectors, namely by using the k centroids obtained via kmeans as the members of the desired codebook. However, treating the problem in a Dictionary Learning framework can lead to significant improvement over the conventional approach, as it was recently shown in works concerning both the acceleration of image classification DNNs (e.g. VGG, ResNet, SqueezeNet) [23], as well as for image-based object detectors (SqueezeDet and ResNetDet) [24].
More specifically, the conventional approximation scheme (referred to as VQ hereafter) can be defined, as follows: where W, C, denote the matrices holding the original subvectors (of a particular subspase), and the codebook (cluster centroids), respectively, while the columns of Γ are one-hot vectors indicating the codewords in C. On the other hand, the Dictionary Learning based approach (referred to as DL hereafter) imposes a decomposition of the obtained codebook, as follows: where W and Γ are as in (1), D denotes the dictionary while Λ is a sparse matrix, with a hyperparameter controlling the sparsity of its columns. The main advantage of the DL-based codebook design first presented in [23], lies in its ability to significantly increase the size of the employed codebooks, without affecting the achieved acceleration, compared to conventional techniques. This is due to the decomposition of the codebook defined in (2), which decouples the size of the codebook from the number of operations it incurs. To be more specific, the size of the codebook is determined by the size of Λ, whose sparsity limits the number of required operations, while the main bulk of the operations is due to the dense dictionary D, whose size can be controlled separately. This advantage results in a better quantization error for the same target acceleration, which is ultimately translated into better performance by the accelerated network (compared to the VQ approach).

A. Training and evaluation
Both networks were trained with the KITTI 3D object detection [5] benchmark consisting of 7481 training images and 7518 test images as well as the corresponding point clouds, comprising a total of 80.256 labelled objects In our study, three classes are mainly examined, cyclists, pedestrians and cars annotated with bounding boxes containing the objects in the 3D scene. 3716 annotated Velodyne point cloud scenes were used for training and 3769 annotated Velodyne point cloud scenes were used for testing and validation. For the deployment and retraining of PointPillars and PV-RCNN, the OpenPCDet framework [25] was employed. For the initial evaluation, pre-trained instances were used, while for the retraining, the Adam optimizer was employed with learning rate l r = 0.003, weight decay rate D W = 10 −2 and a batch size B = 4. Training took place in an NVIDIA Geforce RTX 2080 with 16GB VRAM and compute capability 7.5. Furthermore, for the Pointpillars network the detection accuracy was evaluated on NVIDIA Jetson TX2, while for the PV-RCNN network, due to model size, the detection accuracy was evaluated on the NVIDIA Geforce RTX 2080.

B. Acceleration scheme
In our experiments, we apply the VQ and DL weight-sharing techniques to the PontPillars and PV-RCNN models, targeting their convolutional layers, and measuring the performance drop induced by the acceleration, compared to the original networks. The reported acceleration ratios are defined as the ratio of the original to the accelerated computational complexities, measured by the number of multiply-accumulate (MAC) operations. To achieve our acceleration goal, we followed the stage-wise strategy presented in [22], whereby the individual layers are accelerated progressively in stages, starting from the original network. At each stage, the parameters of one or more layers are quantized using the presented techniques and fixed, and subsequently, the remaining layers are re-trained to adapt to the newly presented changes. The process is then repeated for the convolutional layers involved in the next stage, and so on until all desired layers are accelerated. The KITTI 3D object detection dataset is employed for the fine-tuning and performance evaluation, ensuring that the same training examples that were used during the initial training are also used during the fine-tuning step. a) Accelerating PointPillars: PointPillars is a fully convolutional network with its feature-extraction part (both 2D and transposed convolution operators) being responsible for 97.7% of the total MAC operations required. In total Pointpillars network encompasses 4.835 × 10 6 parameters and require 63.835 × 10 9 MACs. For a good balance between acceleration and performance drop, we targeted the 2D convolutional layers of PointPillars (consuming approximately 47% of the total MACs), as well as the 4 × 4 transposed convolutional layer of the network (responsible for 44.4% of the total MACs), depicted with the red blocks in Fig. 1(a). Acceleration was performed in 16 acceleration stages with each stage involving the quantization of a particular layer, followed by fine-tuning. Using acceleration ratios of α = 10, 20, 30, and 40 on the targeted layers, lead to a reduction of the total required MACs by 82%, 86%, 88%, and 89%, or equivalently, to total model acceleration of PointPillars by 5.6×, 7.6×, 8.6×, and 9.2×, respectively.
b) Accelerating PV-RCNN: The main bulk of the operations required by PV-RCNN are consumed by the Voxel-Backbone and the BEV-Backbone blocks shown in Fig. 1(b), with the former one being composed of Submanifold Sparse 3D-Conv layers [26], while the latter consisting of regular 2D convolutional layers. Since the Sparse convolutional layers are already specialized layers that are designed to exploit the sparsity of the input to reduce their computational complexity, and keeping in mind that the number of operations required by such layers is input-dependent, in this experiment we focused only on the BEV-Backbone block of PV-RCNN, as shown in Fig. 1(b). PV-RCNN network encompasses 12.405 × 10 6 parameters and requires 88.878 × 10 9 MACs without taking into account the sparse convolutional layers. In this case, the targeted layers (highlighted in Fig. 1(b)) are responsible for roughly 86% of the MACs required by the BEV-Backbone block. Similarly to the previous experiment, using acceleration ratios of α = 10, 20, 30, and 40 on the targeted layers, lead to a reduction of the MACs required by the BEV-Backbone block by 77%, 82%, 83%, and 84%, or equivalently, to the block's acceleration by 4.5×, 5.5×, 6.0×, and 6.3×, respectively.

C. Metrics
The official KITTI evaluation detection metrics include bird's eye view (BEV), 3D, 2D, and average orientation similarity (AOS). The 2D detection is done in the image plane and average orientation similarity assesses the average orientation (measured in BEV) similarity for 2D detections [27]. The KITTI dataset is categorised into easy, moderate,   and hard difficulties, and the official KITTI leaderboard is ranked by performance on moderate. For the sake of selfcompleteness, easy difficulty refers to a fully visible object with a minimum bounding height box of 40px and max truncation of 15%, moderate difficulty refers to a partially occluded object with a minimum bounding box height of 25px and max truncation of 30% and hard difficulty refers to a difficult to see an object with a minimum bounding box height of 40px and max truncation of 50%. Each 3D ground truth detection box is assigned to one out of three difficulty classes (easy, moderate, hard), and the used 40-point Interpolated Average Precision metric is separately computed on each difficulty class. It formulated the shape of the Precision/Recall curve as AP| R = 1 /|R| r∈R ρ interp (r) averaging the precision values provided by ρ interp (r), according to [28]. In our setting, we employ forty equally spaced recall levels,

D. Object detection
In this section, the impact of the VQ and DL acceleration techniques on the performance of PointPillars and PV-RCNN is presented, following the procedure outlined in Sec. IV. Table I, summarizes the average precision (AP) for various acceleration ratios in the case of PointPillars. For each category (namely, car, cyclist, pedestrian), the three AP correspond to the three levels of difficulty (namely, easy, moderate and hard) that the evaluation dataset provides. It is observed, as expected, that the impact on performance is greater as the acceleration ratios are increased, a tendency also observed for other values of acceleration ratios that are not depicted here. The relative performance drop is at most 1%, 3% and 5% for the three difficulty levels (easiest to hardest) of "car", while it is at most 18%, 21% and 21% for "pedestrian" and 11%, 15% and 16% for "cyclist". It is observed that these results are promising when the two weight-sharing techniques are employed, as the aforementioned maximum performance drops correspond to a considerable reduction on MAC operations, while the impact is negligible for "car". Note that these acceleration gains can be further enhanced by more tailored MCA configurations involving e.g. layer-specific compression/acceleration ratios. Finally, an example depicting false positive errors from the application of the original and the accelerated networks is shown in Fig. 2. For PV-RCNN, the performance of the model remains practically unaffected by the acceleration of the targeted layers, as shown by the results summarized in Table II. This can be attributed to the limited extent of the affected part of the network, but also to the quality of the employed acceleration techniques. Moreover, it is interesting that applying the procedure described in Sec. IV-B, which involves progressively isolating and fine-tuning specific portions of the network, has even proven beneficial for the overall network's performance, for certain combinations of categories/acceleration ratios.

V. CONCLUSIONS
This work investigated the impact of two recently proposed weight sharing MCA techniques on the performance of two state-of-the-art 3D object detectors in the frame of automotive applications, namely Pointpillars and PV-RCNN. Specifically, this work investigated was the reduction in MAC operations and storage with respect to the deployment of the aforementioned networks to embedded AI computing platforms and more specifically NVIDIA Jetson TX2. The evaluation of the impact was performed on the KITTI 3D object detection benchmark and demonstrated significant acceleration gains while retaining to a great extent the performance of the original networks. As a next step, we are investigating their deployment on Deep Neural Network ASICs.