Robust detection method for improving small traffic sign recognition based on spatial pyramid pooling

An extraordinary challenge for real-world applications is traffic sign recognition, which plays a crucial role in driver guidance. Traffic signals are very difficult to detect using an extremely precise, real-time approach in practical autonomous driving scenes. This article reviews several object detection methods, including Yolo V3 and Densenet, in conjunction with spatial pyramid pooling (SPP). The SPP principle is employed to boost the Yolo V3 and Densenet backbone networks to extract the features. Moreover, we adopt spatial pyramid pooling to learn object features more completely. These models are measured and compared with key measurement parameters such as average accuracy (mAP), working area size, detection time, and billion floating-point number (BFLOPS). Based on the experimental results, Yolo V3 SPP outperforms state-of-the-art systems. Specifically, Yolo V3 SPP obtains 87.8% accuracy for small (S) target, 98.0% for medium (M) target, and 98.6% for large target groups in the BTSD dataset. Our results have shown that Yolo V3 SPP obtains the highest total BFLOPS (66.111), and mAP (99.28%). Consequently, SPP upgrades the achievement of all experimental models.


Introduction
Traffic signs play a key role in road traffic networks. They show drivers what may be encountered in the current segment of the lane, alert drivers to danger and environmental hazards, and inform drivers of road rules and regulations. Identifying and recognizing road signs is thus an exciting path of scientific research with critical applications in avoiding road collisions and ensuring driver safety (Wang 2018) (Wali et al. 2019).
Someday, fully automated systems will correct dangerous driving habits based on sign identification and recognition (Fatmehsari et al. 2010). Traffic sign recognition (TSR) systems are complex and challenging tasks because the image contains various challenges, including occlusion, brightness, color alteration, distortion, and skew occurring in the background as set up by the camera. There may be several signs in a picture of various colors, sizes, and shapes (Tabernik and Skocaj 2020). Therefore, basic research on the detection of traffic signs also faces significant difficulties. The core to the power and precision of the identification of road signs is the recognition of small signs in a diverse environment (Zhang et al. 2020a). Extensive work has been carried out on the identification of traffic signs. Color and shape algorithms for identifying a traffic sign (Xu et al. 2019b) (Yu et al. 2019c) have been proposed to gather details of color and shape and provide features obtained from the region of interest area (ROI), including the traffic signs. Traffic signals are often tiny items that typically occupy tiny portions of picture areas, often smaller than two percent of the picture area (Ou et al. 2019). General objects usually occupy much of the whole image (Chen and Lu 2016; 1 3 Dewi et al. 2020c). Small traffic signs is difficult to identify accurately. Thus, developing a robust TSR in real-time represents a difficult task given the latency during the test time and the need to identify a small target.
This research paper analyzes and examines four convolutional neural network (CNN) models to identify the traffic sign together with SPP to optimize the identification of small signs. We fine-tune them on the Belgian traffic signs dataset (BTSD) to display traffic sign detection and recognition. From our own experience no research paper analyzes multiple object detectors depending on deep learning while primarily focusing on the traffic sign detection and recognition issue. Further, no research paper measures numerous significant variables, including mAP, IoU, and detection time in detail.
The paper's principal contributions are: (1) We offer a brief review of CNN's object detection and recognition methods, particularly Yolo V3, Yolo V3 SPP, Densenet, and Densenet SPP. (2) We review and assess diverse state-of-theart object detectors, primarily those intended to improve the small traffic signs detection and recognition. The assessment of these models involves fundamental parameters including mAP, detection time, IoU, the total BFLOPS, and workspace size. (3) We employ spatial pyramid pooling to collect the local region features at different sizes in the equal layer to learn multiscale object features more comprehensively. (4) This work applies an image data preprocessing step, including mask, blur, binary, closed, erode, and dilate. Findings demonstrate that Yolo V3 SPP is the most accurate and that SPP enhances the achievement of all experimental models. We perform a detailed series of experiments to validate Yolo V3 SPP's efficiency in defining different target sizes. Experiments demonstrate that Yolo V3 SPP outperforms state-of-the-art schemes, reaching an accuracy of 87.8% for small (S) ((0) > S < (32 × 32) pixels), 98.0% for medium (M) ((32 × 32) > M < (96 × 96) pixels) and 98.6% for large (L > (96 × 96) pixels) size targets in BTSD dataset. Furthermore, Yolo V3 SPP is 12.49% more leading than the stateof-the-art, AN + FRPN (Yang et al. 2018).
The remainder of this work is organized as follows. The materials and processes are presented in Sect. 2. The experiment results are briefly explained in Sect. 3. Section 4 analyzes, contrasts, and describes the findings obtained for traffic sign identification. Finally, Sect. 5 offers conclusions and points the way to potential future research.

Object recognition with CNN
Deep-convolution neural network (CNN) has recently been implemented for object recognition and detection (Krizhevsky et al. 2017a). In 2012, Krizhevsky et al. confirmed CNN's capability to significantly enhance image classification precision in the ImageNet Large-scale Visual Recognition Challenge Competition (Krizhevsky et al. 2017b). Initially, CNN was deepened and expanded for better precision, as algorithms have become smaller and more successful in recent years. Recent deep learning algorithms, specifically those incorporating CNN, such as You Look Only Once (Yolo) V3, have shown great power in high precision target detection projects (Feng et al. 2020).
In using spatial pyramid pooling networks (SPP) Dewi et al. 2021a), the Microsoft Asia Research Institute is initially advancing mapping and measuring candidate regions' location in the central convolution layer of the expected functional map. Next, a pooling layer is used to decrease dimensions based on the SPP algorithm. Last, a feature layer of a precise size is achieved. SPP has been introduced to develop R-CNN performance by distributing calculation. In addition, SPP only calculates the feature maps from the whole input image once. Next, it delivers arbitrarily sized images to produce long compressed images for training detectors. The reconstructed estimation of CNN function maps using SSP is therefore avoided. However, SPP requires training in a multi-stage pipeline because vectors delivered by different SPP layers are further carried out on layers that are all connected. The method as a whole is thus very slow. Furthermore, Yolo (Liu et al. 2016) encapsulates computation in a single full CNN, rather than providing a series of region proposals. This power leads to an event detector even more rapidly (Zhao et al. 2019). Yolo V3 (Redmon and Farhadi 2018) used Darknet-53 as the backbone network to replace Darknet-19 and used multiscale estimation (Bangquan and Xiong 2019).

Spatial pyramid pooling (SPP) network
In the process of object detection and recognition, SPP (Basbug and Sert 2019;Grauman and Darrell 2005;Lazebnik et al. 2006) has been surprisingly effective. Despite its simplicity, SPP is competitive with methods that use more complex spatial paradigms. At every stage of the pyramid, the representation of the spatial pyramid divides the picture into an ever-finer grid sequence (Tai et al. 2020). SPP, also popularized as spatial pyramid matching (SPM) (Sivic and Zisserman 2003) is one of computer vision's main approaches to designing the Bag-of-Words (BoW) paradigm. SPP was a vital component of leading and efficient systems for classification (Yang et al. 2009;Wang et al. 2010) and detection (Van De Sande et al. 2011), long before CNN's recent growth.
There are several advantages to SPP. First, SPP provides a fixed-length result, and the input image can be of all sizes. It enables both temporary and subjective aspect ratios. The picture can be resized at any rate and refers to the same deep network. Second, SPP operates multiscale spaces, while the pooling of sliding windows uses just one window. Further, SPP allows us to produce images of arbitrary size for testing and enables us to supply images of varying dimensions and sizes during the training process. To reduce overfitting, we can train with several images that are set to unchanged size. SPP is a manageable answer for managing aspect ratios, diverse scales, dimensions, and sizes Fu et al. 2020). These results are significant in computer visual recognition but have gained little attention in deep network research (Xu et al. 2019a). Figure 1 presents a network configuration with an SPP layer. The SPP block layer is entered into our Yolo V3 and Densenet configuration files.

Yolo V3
Yolo algorithm (Redmon and Farhadi 2017) is a typical end-to-end and one-stage network structure. This network construction is more compact than the two-stage network of R-CNN series algorithm ) (Xu et al. 2020) that first produces candidate suggested regions and then presents detection with experience. Yolo V3 divides the feature map into a N × N grids (Chen et al. 2019b), and each grid then generates B bounding boxes that predict the target. Subsequently, N × N × B is created on the feature map, which includes the entire area of the feature map and directly performs border regression on the generated forecast bounding box. It was introduced by Redmon and Farhadi (2018) in 2018. Yolo V3 mixed the contestant area mechanism and detection into the equivalent network, thus making the detection and recognition velocity more rapid than the R-CNN series.
Moreover, Yolo V3 applies multiscale fusion to predict and use a single layer of CNN to prepare the entire image. The groups are used as boxes for estimating boundary lines. Additionally, the K-means algorithm is applied to perform dimensional clustering on the target boxes and achieve 9 boxes of various sizes. Yolo V3 permits a single bounding box anchor for any ground truth object. Hence, the grid can detect the object if the center point of the object's ground truth drops inside a particular grid. Figure 2 and Eqs. (1)-(4) explain the bounding boxes with location prediction and dimension prior. b x , b y , b w , b h are the x, y core coordinates, width, and height of our estimation. Network outputs are t x , t y , t w , and t k . Next, the grid's top-left coordinates are c x and c y , while p w and p h are anchors dimensions for the box (Redmon and Farhadi 2017) ).

Fig. 2 Bounding boxes
Yolo V3 uses K-means 9 cluster for calculating the anchor box (Chen et al. 2019b;Dewi et al. 2020a). The function of the K-means algorithm is conducting latitude clustering to make anchor boxes and adjacent ground truth have more comprehensive IoU values, which is not immediately associated with the size of the anchor boxes. Algorithm 1 demonstrates the procedures for K-means clustering for the anchor box. Liu et al. introduced DenseNet (Huang et al. 2017;Wang et al. 2019) in 2017. Feedforward is used to link each layer to all other layers. The function maps of any preceding layers are applied as feedback for each layer. Similarly, the map layers are used by all future layers as inputs. Densenet has more than 40 layers and has a far greater level of convergence (Huang and Pun 2019). To minimize the requirement for feature duplication in the network design and maximize the reuse of features (Yu et al. 2019a), Densenet allows the convolution layer to accept extra feature channels that have the equal level component of cross-level component. Further, Densenet has useful advantages: it facilitates feature reuse and alleviates the issue of disappearing gradients (Yu et al. 2019b;Dewi et al. 2020b).

Densenet
Densenet consists primarily of a dense block, a transition layer, and a rate of growth (Ghatwary et al. 2019;Dewi et al. 2021b), along with X dense blocks (Huang et al. 2020b). There are m layers within any dense block in which each layer is linked to all consecutive layers in a feed-forward way (Huang et al. 2017) (Zeng and Xiao 2019). Hence, the l-th layer obtains the feature maps from several earlier layers as input: where H l displays the composite function operation in this layer and the function concatenation is performed within each feature layer.
The transition layer is a layer between each dense block that reduces the spatial dimension of the feature maps. Further, transition layer consists of 1 × 1 convolution layers and 2 × 2 average pooling layers. A feature map f is the output from each connection function and shown in Eq. (5). The size of the m th layers is f (m − 1) + f 0 , where f 0 is the number of channels of the fundamental input image. To increase the parameter performance and manage the growth of the network, the scale of f is restricted to a growth rate G with a short integer value. The variable accommodates to control the amount of new data each layer contains. Figure 3 represents the Densenet architecture in this experiment.
Apart from the introduction of small traffic alerts, similar work could be applied in several other promising scenarios, including Blockchain Densenet architecture combined Bloom Filter with an exponential histogram to query streams in the sliding window to identify heavy hitters. This method is called EBF sketches. The sketch structure allows for effective summarization of streams over time-based sliding windows with guaranteed probabilistic accuracy. It can be applied to address obstacles such as managing frequency statistics and finding heavy hitters. Li et al. (2020) present a new deep learning architecture for regional epitaxial traffic flow prediction called generative adversarial capsule network (GACNet), which predicts traffic flow of surrounding areas based on inflow and outflow information in the central area. The method is data-driven, and the spatial relationship of traffic flow is described by dynamically modifying traffic information into images through a two-dimensional matrix.
Other research proposes an efficient algorithm for improving the flattening result of triangular mesh surface patches having a convex shape (Yavuz et al. 2019). The dynamic virtual boundary approach is used to reduce the distortions for the triangles near the boundary caused by the nature of the convex combination technique.

Yolo V3 SPP method
In the following sections, we explain our proposed methodology for recognizing the BTSD dataset using spatial pyramid polling based on Yolo V3. Figure 4 describes our color-based detection Yolo V3 SPP architecture. The original image size of the BTSD dataset was 1628 × 1236. This work applied an image data preprocessing step including mask, blur, binary, closed, erode, and dilate (Fig. 5).
In the detection process we use the red and blue colorbased detection method to detect the sign. Erosion fundamentals have eroded the boundaries of the foreground object and decreased the characteristics of the image. Erode will remove pixels from the edges of objects in a binary image. The dilation bases increase the area of the object and are used to highlight the characteristics. Benefits of erosion and dilation include: (1) Erosion helps remove small white noises and used to detach two associated objects.
(2) Erosion is followed by dilation in circumstances such as noise elimination, since erosion removes white sounds but shrinks the surface as well. Hence, we dilate it and our object area increases when noise elimination is performed. Dilate is also valuable in joining broken Fig. 4 Color based detection and Yolo V3 SPP architecture parts of an object. The dilation process is shown in Eq. (6) and the erosion operation is given in Eq. (7) (Fu et al. 2015).
The basic operations of eroding and dilating are widely used in image processing. A by B eroding is defined as: where B is the element structure. Eroding tightens the image and controls the structure to some extent. When A is dilated by B, it means that all of the original points in B are used in building a new set. After mapping and moving, B and part of B were overlapped. Dilation is performed as described in Eqs. (8-9).
The Yolo V3 SPP detection method is explained in Algorithm 2. For the input of the object detection method, the BTSD dataset of 62 class was used. The algorithm processes multiple phases accordingly: (1) Bounding boxes restrict the detected targets.
(2) Objects in the image class are related and the same target in each image is given the same label. (3) The equivalent image will give an equal target uniform label.
(4) NMS is applied for the comprehensive search and to delete similar boxes and inputs and shows the object apprehension accuracy.
The core procedure of the NMS algorithm is as follows: Step (1): We pick the bounding box with the greatest confidence as the target for identification. Next, we compare the IoU within the bounding box and the remaining boxes.
Step (2): We dismiss the bounding box from the remaining bounding boxes when the IoU is higher than the threshold.
Step (3): We choose the bounding box with the second-highest confidence as the target for comparison. We then redo Step (1) and Step (2) until all bounding boxes are moved. In our research, Yolo V3 uses a spatial model to capture the essential features in the max-pooling layers using samples in the convolutional layers. It employs [route] to combine three various scale of a max pool to each image. In each [path] various layers − 2, − 4 and − 1, − 3, − 5, − 6 were alternated and used for each path. The Yolo V3 SPP is managed in a single process for BTSD dataset identification and recognition.

Belgian traffic signs dataset (BTSD)
In this work we used the Belgian traffic signs dataset (BTSD) (Timofte et al. 2014(Timofte et al. , 2011Liu et al. 2019;Mathias et al. 2013) to evaluate the proposed method, as shown in Fig. 6. Each image is 1628 × 1236, with a total of 9007 images in the dataset. This research experiment arranged the traffic sign targets into three levels according to size in pixels as follows: (1) Small (S) targets whose range areas are smaller than 32 × 32; (2) Medium (M) targets whose range areas are within 32 × 32 and 96 × 96; (3) Large (L) targets whose range areas exceed 96 × 96. Complete statistics for the traffic sign target sizes are given in Table 1 (Yang et al. 2018).

Training results
The training procedure generated additional data from the primary images by using simple geometrical transformations, including horizontal and vertical flips, translations,  rotations, scale changes, and scoring. These techniques are often used in the training of large neural networks. Further, the experiment conducts several operations with a number of parameter settings during the data augmentation. The experiment applies rotation range, zoom range, width shift range, and height shift range. The convolutional layers are initialized by the parameters pretrained by ImageNet and the remainder are initialized by random Gauss distribution parameters. This work enhances the model, using a 0.001 for learning rate, 0.9 for momentum and learning rate decay of 0.1 at each epoch. An image processing server with NVidia 2080TI graphic cards was used to train and test the proposed traffic sign detection model. The loss function is necessary to estimate the error in the true and predicted value through the training method. The Yolo loss function shown in Eqs. where coord is the weight of the Err coord . x,ŷ represents the center coordinates, width ŵ , height ĥ , confidence ĉ , and probability p . B symbolizes that each grid predicts B bounding boxes I noobj ij indicating that the object drops inside the jth bounding box of the ith grid. I noobj ij indicates that there are no targets in this bounding box. Further, Err coord is an error in the coordinates. In the experiment we set the coord to 0.5, showing that the width and height errors are light effective in the prediction calculation. The measurements are just made when the grid predicts an object for the coordinate error .
where noobj is the weight of the IoU error; ĉ i and c i denote the predicted confidence and true confidence, respectively. Furthermore, noobj = 0.5 is introduced to weaken the influence of a big number of grids without objects on the loss value. The classification error Err cls defined in Eq. (13).
Err cls is an error in the classification. Cross-entropy is required to measure loss and only operates with a point on (10) Loss = Err coord + Err IoU + Err cls (11) the grid. However, the sigmoid function was applied in Yolo V3 as an activation to predict the level. If two labels are on the same target, the sigmoid function solves the problem more effectively than Softmax (Fang et al. 2020;Dewi et al. 2021c). Figure 7 shows the reliability of the Yolo V3 (a) and Yolo V3 SPP (b) training phase. The loss value for each model is 0.0083 at 74,400 iterations and 0.0129 at 41,500 iterations, successively. The experiment uses 124,000 iterations for maximum batches, and policy is equal with steps. It also uses scale (0.1, 0.1) and steps (99,200,111,600). Figure 8a shows the consistency of the training process with Densenet. The training stage remains constant after 9000 epochs and stops at 45,000 epochs. Densenet DCGAN applies max_batches = 45,000, mask_scale = 1, and the training loss value reaches 0.0057. Furthermore, Densenet SPP implements max_batches = 45,000, mask_ scale = 1, and the iterations remain steady at 9000 epochs and finish at 36,400 epochs with a loss value of 0.0041 in Fig. 8b.
The system starts with no information and uses a high learning rate at the beginning of the training process. When the CNN layer performs its function with an increasing data volume, the weights are adjusted slightly. In the configuration tab, this reduction in learning rate is made by first stating that our strategy to decrease the learning rate is progressive. Through training, we will see various error indicators and will stop the training if the average loss (avg loss) no longer decreases for several cases. Table 2 outlines the complete training process including several evaluation index performance items, such as precision, recall, F1, loss value, mAP, AP, and IoU performance on the BTSD Dataset. Yolo V3 SPP had the greatest mAP, around 99.28% with IoU at 89.48%, supported by Yolo V3 at 98.96% with IoU at 88.54%, Densenet SPP at 99.1% with IoU at 88.12%, and Densenet at 98.9% with IoU at 85.72%. Therefore, SPP can be combined and reinforced with any pattern. For example, Densenet with mAP at 98.9% and IoU at 85.27% offered the poorest performance of any model in the experiment. Furthermore, SPP boosts Densenet efficiency, yielding a mAP of 99.1% (a 10.98% increase) and an IoU of 88.12% (a 2.4% increase) with Densenet SPP. The findings can be subdivided into three types: true positive (TP) with the right sample total, false positive (FP) with the wrong sample count, and true negative (TN) concerning the number of unrecognized samples . The precision (P) and recall (R) are where p(o) is the accuracy of target detection.
The intersection over union (IoU) is another criterion used to evaluate the detection accuracy. It is obtained by calculating the overlap ratio between the predicted and true bounding boxes (Liu et al. 2018;). It is not sensible to scale and nonnegative changes and is a general example of evaluating targets, also known as detection accuracy. The value of IoU is the relationship between the result of the detection, the reality of the truth on the ground, and its relation (Huang et al. 2020a). IoU measures the projection ratio and is shown in Eq. (17) and Fig. 9 (Arcos-García et al. 2018).

Discussions
In this stage, we use 771 images from the BTSD dataset for examination with different sizes (S, M, L) and circumstances randomly. The analyses of precision and time of the experiments are shown in Fig. 10. We calculate the detection time for each image in our test dataset and calculate the average detection time for each algorithm. Generally, Yolo V3 SPP exhibits better accuracy than the other models. Yolo V3 SPP obtains 87.8% accuracy for a small sign and requires Convolution and max-pooling have different advantages. Therefore, convolution subsamples, perhaps in the corresponding sample layers, could be best reversed. Each [route] has been used with different layers − 2, − 4 and − 1, − 3, − 5, − 6 in conv 5 . Max Pooling works by selecting only the maximum values from the neighboring regions to delete noise from the image. Through their mix, the benefits of both establishing YoloV3 and using Densenet backbone networks are influenced by the SPP.
A comparison of detection accuracy using our approach and approaches used in previous research is given in Table 3. Previous research (Yang et al. 2018) uses Faster-RCNN to detect the BTSD dataset and obtains 43.93% accuracy for small targets, 88.05% for medium targets, and 96.82% for large targets. The AN + FRPN (attention network + fine region proposal network) reaches 50.82% accuracy for small targets, 97.8% for medium targets, and 98.31% for large targets. Our proposed method outperforms other methods with 87.8% accuracy for small targets, 98.0% for medium targets, and 98.6% for large targets. Figure 11 presents the recognition results for Class D7 using all models. The models shown in Fig. 11a, b and Fig. 11d recognize sign E1 well. However, Densenet, shown in Fig. 11c, fails to distinguish the sign. The highest accuracy obtained by Yolo V3 SPP in Fig. 11b is 97.24% with a size of S and 0.5256 s for the detection time. Furthermore, Fig. 12 illustrates the recognition results for Class E1. The results of the experiment shown in Fig. 12 indicate that every model recognizes class E1 well with a range of bounding box coordinates, times, and accuracy. Yolo V3 SPP can identify sign E1 with an accuracy of 99.94%, a size of S, and a detection time of 0.3313 s (Fig. 12b). Figure 13 shows the estimation of BFLOPS, workspace size, and layers for all models in the tests. Yolo V3 provided a total BFLOPS of 65.773, allocated extra workspace  Fig. 14. The location precision of Yolo V3 SPP Fig. 14b is higher than that of the other models. Yolo V3 SPP can identify both signs in the image. Further, sign F49 was detected in 0.2885 s with 94.10% accuracy and size M. Sign D7 required 0.2269 s to be recognized with 93.92% accuracy and size S. In Fig. 15c, the algorithms exhibited false and missed detection. However, for the last image in Fig. 15d, Densenet SPP can detect all signs with an accuracy lower than that of Yolo V3 SPP. The results of the experiments indicate that SPP increases the effectiveness of the recognition and interpretation of the BTSD dataset. SPP boosts the YoloV3 and Densenet performance and backbone network.
While SPP takes more time, it is easier to detect multiple images using this model. As shown in Fig. 14b, and 15b, Yolo V3 SPP can identify every sign in the image. Yolo V3 and Yolo V3 SPP only load 107 layers and 114 layers successively, less than the other models. Densenet SPP has the highest number of layers, around 312 layers, and a large workspace (104.86 MB).

Conclusions
Preliminary identification and analysis of four traffic sign models based on deep neural networks are presented in this study. All detectors' first appearances, including the performance of accuracy, workspace size, detection time, and the number of floating-point operations through CNN, are analyzed and discussed. This work applies SPP and changes the Yolo V3 and Densenet backbone networks. Hence, we use SPP to boost the local area at different scales in the same convolutional layer to learn more about multiscale features. The results of our experiments demonstrate that SPP can improve the efficacy of identifying and understanding the sign. Moreover, the experiments show that Yolo V3 SPP is superior to the most advanced sign detection systems. Yolo V3 SPP is able to achieve 87.8% accuracy with small (S) targets ((0) > S < (32 × 32) pixels), 98.0% accuracy with medium (M) targets [(32 × 32 pixels) > M < (96 × 96) pixels] and 98.6% accuracy for large (L) targets (L > (96 × 96) pixels) in the BTSD dataset.
We will develop a dataset of all traffic signs in Taiwan in a future study. We will combine Yolo V4's optimal speed and accuracy of object detection with Explainable AI (XAI) to achieve better performance and results. Furthermore, our future work will explore the maximum detection distance of the sign and detection rate according to various distances and speeds.