A comprehensive review of vehicle detection using computer vision

A crucial step in designing intelligent transport systems (ITS) is vehicle detection. The challenges of vehicle detection in urban roads arise because of camera position, background variations, occlusion, multiple foreground objects as well as vehicle pose. The current study provides a synopsis of state-of-the-art vehicle detection techniques, which are categorized according to motion and appearance-based techniques starting with frame differencing and background subtraction until feature extraction, a more complicated model in comparison. The advantages and disadvantages among the techniques are also highlighted with a conclusion as to the most accurate one for vehicle detection. This is an open access article under the CC BY-SA license.


INTRODUCTION
Recently, video cameras have been extensively used for traffic surveillance applications, which effectively provide valuable information about traffic flow. Additionally, the usage of video cameras for traffic surveillance is complemented by the incorporation of advanced technology, for example, computer graphics, state-of-the-art cameras, high-end computing capabilities, and automatic video analysis [1]. For road safety and security, detection of vehicle of make and model has become necessary for the identification of vehicles on urban roads. Vehicle detection is the key for numerous applications related to the surveillance camera but vehicle detection comprises multiple proven techniques which this paper, describes in detail [2,3].
For surveillance systems, one of the most popular types of devices are visible spectrum cameras, which have been extensively used for monitoring environments, events, activities, and people. Different studies in background-foreground segmentation [4][5][6][7] and object detection classification [8][9][10] have also attempted to automatically analyze image or video data from surveillance cameras.
Another method that is fast gaining popularity because of the significantly increased road transport levels especially on urban roads, is the intelligent transportation system (ITS). The aim of ITS are to improve driving safety and security, thereby making roads more flexible even during traffic congestion [2]. This is achieved by applying the principles of computers, electronics and communication to various systems, and involving vehicles, drivers, passengers and managers in all inter-communicate framework. ITS utilizes visual Journal homepage: http://journal.uad.ac.id/index.php/TELKOMNIKA

TELKOMNIKA Telecommun Comput El Control
Ì 839 appearance to detect, recognize, and track vehicles which may facilitate the detection of incidents and analyze behavior [1,11]. It is also able to retrieve data on traffic flow parameters, such as vehicle class, count and trajectory. A crucial part of ITS involves vehicle detection, with much work being done on the topic, but there are still challenges facing the effective detection of vehicles from the front or rear, illumination and shadows affecting the detection of vehicles and occlusion issues. Many recent studies have attempted to develop appearance-based techniques to detect vehicles just from still images. This paper seeks to provide a comprehensive review of the various techniques, particularly those in vehicle detection, that could be available for video-based traffic surveillance from a computer vision perspective.

2.
VEHICLE DETECTION The first step in video-based analysis for ITS is vehicle detection [1], which is very important for vehicle recognition, tracking and enhanced processing [12]. Figure 1 shows the main steps of the ITS system. Research into the field is further subdivided into two categories: motion-based and appearance-based techniques [13]. Motion-based techniques utilize motion cues to differentiate among moving vehicles from stationary background. Appearance-based techniques utilize appearance features of vehicles such as color, shape and texture to isolate the vehicle from the surrounding background scene.

Motion segmentation
Motion segmentation, a crucial step in computer vision algorithms, seeks to decompose video into a series of moving objects and background. For computer vision algorithms, the first important step is decomposition. Various applications rely on this, including metrology, video surveillance, inspection and robotics among others. Various studies have highlighted the problem of segmentation, but the performance is not perceptible enough for human beings [14,15]. Motion segmentation could be categorized either as background subtraction or optical flow [16].

Background subtraction
Background subtraction, studied since the 1990s, is employed to detect moving objects in videos without prior knowledge. It is mainly employed for video-surveillance applications, as the need for detecting people, animals and vehicles arise before more complicated processes such has intrusion detection, tracking and people counting could be undertaken [17,18]. Background subtraction techniques could be categorized as parametric, non-parametric or predictive.

Non-parametric background modeling
These techniques employ the pixel history of images to develop probabilistic representations of perceptions [19]. Examples of this technique are Kernel density estimation (KDE), codebook model and background extractor (vibe). Background modelling, while still difficult to use in complicated scenarios, is crucial to video surveillance. Difficulties arise from illumination variance and dynamic backgrounds.

Kernel density estimation (KDE)
Kernel density estimation (KDE) is a nonparametric technique which attempts to approximate density, whereby a known density function (i.e. the kernel) is taken as the mean within perceived data points creating fluid estimations [20]. A novel technique for background modeling based on adaptive non-parametric kernel density estimation (AKDE) proposed [21]. In this technique, base-line system is identified to solve the problem of detecting foreground regions in videos with semi-stationary background. AKDE deals with the multi-modality of the background as well as scene independence. The benefits of AKDE are due to global Ì  ISSN: 1693-6930 threshold not being required for pixels in video scene, different and adaptive thresholds for individual pixels. Instead of altering parameters, the system is able to reliably work with various video scenes. Two innovative techniques are also proposed for background modeling based on non-parametric density estimation and recursive modeling. For novel recursive modeling method (RM) updates, the model is updated online when a new frame is present rather than processing video frames to generate background models. The method is very robust (accuracy of roughly 84%) compared to other techniques, due to its rapid and non-periodic nature. The caveat to the method is that it may fail if no probabilistic model for background or foreground is present.
Another researcher, [22], proposed an efficient background subtraction framework dealing with variance in illumination on the feature level, which also employed local binary pattern (LBP) histograms for background modeling. Furthermore, LBP was extended to a scale invariant local ternary pattern (SILTP) operator. Precise segmentation of dynamic objects could be accomplished by modeling pixel process with single local patterns. The problem is that local patterns are not ordinal numerical values and cannot be modeled immediately to traditional density functions. This leads to the proposal of a pattern kernel density estimation (PKDE) to effectively model probability distributions. Complex scenes data sets are proposed and contrasts with current online background subtraction algorithms like Gaussian ("MoG"), pixel-wise LBP histogram based (LBP-P) and blockwise LBP histogram-based approach (LBP-B), having achieved over 92.40. The prime disadvantage with the method, however, is the inability to work with color video.
A KDE technique over a joint domain-range representation of image pixels proposed [23]. This technique could model spatial relations among observed intensities using a joint domain-range representation of pixels. Using samples from previous frames, background and foreground processes may be modeled with non-parametric density estimation. Joint domain-range KDE shows the background and foreground processes by bringing together three color and two spatial dimensions within a five-dimensional joint space. Using this method, accuracy of 71.66% was reached, but bore the disadvantage of the foreground priors not being able to be modeled for complicated objects and also does not possess object tracking information.
KDE has also previously been employed to track vehicles with video from vertically placed camera [24]. Foreground objects are tracked with automatic adaptation to different environments. Evolutionary computing framework is also employed for pose estimation by recovering the pose of vehicles in subsequent frames, with the deformable modeling being fitted into the image data to process the occlusion. Using this method, the accuracy was found to be 88.5%, with the limitation arising due to the vehicle not being able to be tracked without the shadow removal algorithm and inability to be processed on RGB frames directly. KDE also has difficulties in detecting too many objects in the same scene.

Codebook model
The Colebook model considers a number of dynamically handled codewords to model background pixels, by substituting parameters of probabilistic functions using pixel textures, pixel color and region experience [25]. Foreground region is detected using clustering of texture information. Color cues from background are also modeled by the codebook and utilized for refinement of texture-based detection results derived from color and texture traits. Using this method, the accuracy was found to be 69%, with the limitation being the detection of foreground regions when both texture and color are the same as background scene. In another study, a hierarchical background subtraction algorithm consisting of block-based stage and pixel-based stage was proposed [26]. In block-based stage, apparent backgrounds are detected with block-based codebook, while maintaining spatial relations among questionable foreground pixels. This way, the accuracy of this work was found to be only 60%, which is low compared to that obtained by other studies. There was also a limitation in being able to detect foreground in RGB space.
Another method to detect the moving vehicles in crowded traffic scenes was a Gaussian pyramid layered algorithm and codebook in conjunction with Local binary patterns to construct low-resolution and high-resolution for the Gaussian pyramid [27]. Using this method, the accuracy was found to be 69.861%, which is comparatively high. In summary, the codebook technique is more advanced than KDE, especially when dealing with foreground scene, as it can readily distinguish between background and foreground details. The main limitation is in the low accuracy when a scene has a similar color in RGB space to background.

Visual background extractor (ViBe)
The visual background extractor (ViBe) is a robust method employed to extract background with relatively high accuracy and lower computational cost. In this method, the novel usage is with random policy to choose values upon which to build sample-based background estimation [28]. A study proposed a method which considers a universal sample-based background subtraction algorithm, used to store a set of values for each pixel taken beforehand [28,29]. A comparison is then made to the present pixel value to detect whether the pixel belongs in the background and adjusts the model accordingly. It then compares this set to the current pixel value in order to determine whether that pixel belongs to the background, and adapts the model. With random selection of values to substitute from background, this method is novel from classical methods which consider the oldest values being replaced first. Table 1 shows all the techniques that are used in non-parametric model along with their accuracy and limitations. KDE [24] 88.5% Vehicle cannot be detected without shadow removal algorithm and cannot work on RGB frames directly.
Codebook model [25] 69% Fails to detect foreground regions when their properties of both the color and the texture are similar to those of the background scene.
Codebook model [26] 60% Fails to achieve high accuracy on RGB color space.
Codebook model [27] 69.861% Did not consider the foreground scene or multi objectives on the foreground.

Parametric background modeling
Parametric methods consider the pixel intensity (i.e. pixel color) being modeled by employing a probability distribution with a known parametric form [30]. A popular method for defining the mean and variance as parameters is the Gaussian distribution, which estimates parameters with temporal averaging or temporal nonlinear filtering when a single Gaussian distribution is used for background models [31]. However, these methods are not applicable to complex dynamic scenes whereby background pixels could vary greatly in color distribution. Table 2 shows the works based on parametric background model.

Frame averaging
In traditional averaging techniques, a set of N frames are summed up and divided by the number of frames [32]. The resulting model is then subtracted from subsequent frames, a popular method used by other studies [32,33]. The accuracy was shown to be 95% in one study [32] but could not perform with effects of shadows, while another study showed accuracy of 97% while still suffering from the same problem [33].

Single Gaussian
Temporal single Gaussian is used to model background iteratively, enhancing performance and reducing memory used. A single Gaussian background model is used for foreground segmentation, with every segmented foreground silhouette, followed by a feature vector based on silhouette measurements as calculated [34]. Foreground pixels which are part of a dynamic object are readily detected using an adaptive background subtraction scheme, whereby each background pixel is modeled as an individual Gaussian process with mean µ (time-averaged intensity) and σ (standard deviation of intensity) [35].

Sigma-delta
First shown by A. Manzanerra [36], the sigma-delta background estimation seeks to compare between elementary increments and decrements. Using this method, accuracies as high as 95% have been achieved, with false alarms of only 45%. Another study by S. Toral showed improvements of stability [37], with the enhanced method selectively recalculating background pixels to successfully detect slow moving vehicles. Another study considered a novel algorithm for queue-parameter estimation by utilizing two background models estimating temporal stops in vehicles [38]. The method was found to be computationally efficient, with accuracy of 98%, while the main limitation was in detecting a vehicle when another vehicle passes by the same point.  [39]. In one study, GMM was utilized for background subtraction [40], but yielded noisy images due to false classification. The method was enhanced with hole filling algorithm (HF), with resulting accuracy of 97.9%. Other studies have achieved accuracy of 91.26% [41] and 97.22% [42] 79.63% in heavy traffic conditions), using the same method. Another study [43] found the extraction speed of this method not capable enough with resulting noisy blocks. In an attempt to solve the issue, a vehicle ROI extraction method based on area estimation GMM was developed, with results showing detection speed per frame to be 30 (frames/sec), having tested on 537 vehicles. This result is quite convincing in terms of speed and ability to filter out noise. Effective filter out ambient noise but requirements to increase processing speed further and use more vehicles during testing.

Predictive background modeling
Predictive procedures are employed in modeling the state dynamic of each pixel. Examples include Kalman filtering [44,45] and eignspace reconstruction or eigenbackground [46].

Kalman filter background modeling
The Kalman filter, developed by Kalman and von Brandt, is a recursive estimator which computes current measurement and estimated state in order to approximate the present state [47]. While this method could approximate background images (with each filter representing one color), foreground could potentially be noisy for filter states, and changes in illumination are non-Gaussian noise which disrupt the utilization of Kalman filters. The method is still effective for dealing with multiple objects for tracking rather than detecting as the latter requires separating background scene from foreground.
One study by [48] developed a technique for detecting, tracking and counting vehicles using Kalman filtering, where the background was updated using the method to detect, track and count vehicles. Accuracy of the method was found to be 97.50%, with the main disadvantage being the difficulty in managing moving background pixels. Another study [49]  found to be higher than 75%. The limitation of the study was inability to recognize very small objects, thereby requiring the initiation of multimodal features to identify objects.

Eigenbackground
Eigenbackground is a robust method for background subtraction, especially in crowded scenes with slow moving foreground objects. The method is also relatively computationally efficient. One study by Oliver et al. investigated background modeling for offline learning and online classification [46]. The mean of sample frames set was computed and subtracted from all frames, after which the covariance matrix was computed and eigenvector matrix composed from the optimal eigenvectors. For classification, the frames are projected onto the eigenspace and back projected onto image space to determine background models.
Using this method, the study reported accuracy of 68.7% for hidden Markov models (HMMs) and 90.9% for coupled hidden Markov models (CHMMs). Another study considered block-level eigenbackgrounds, whereby the original frame was divided into blocks to perform independent block training and subtraction [50]. In [51] improved the algorithm further for choosing optimal eigenbackgrounds for individual pixels by selective training, model initialization and pixel-level reconstruction. In [52] proposed another variant of this method targeted towards crowded scenes. Using selective mechanisms for background modeling and subtraction, virtual frames were generated as training and updated samples for the eigen backgrounds. This way, optical background models could be initialized, called selective model initialization and the best Eigen background could be chosen.
Thus far, all methods that have been discussed in background subtraction section show that background modeling is a widely used approach for detecting moving objects from static cameras. It is the general way of motion detection. It is a process that finds the difference of the current image and the background image to detect the motion region, and it is commonly efficient to deliver data included object information. In Table  3, all the recent predictive techniques are illustrated. To be clearer about this technique, the latest techniques employed by various study results are shown to show their methodology. The limitation of the background subtraction technique is still its low accuracy in crowded scenes and also the tendency to often fail if the foreground scene possesses multiple objects. But the main disadvantage of background subtraction and the most questionable one is that it works only on videos frames and cannot work on still images. This is why recently, especially in last ten years, the direction of vehicle detection employed feature-based techniques because due to its ability to work on still images.

Optical flow
Optical flow detects motion with analysis of pixel motion, whether as a group or individually. The concept of optical flow is derived from how human beings observe motion of objects by capturing and analyzing gradient, brightness, lights reflected and so on [53]. The method was proposed by [54] who employed flow vector characteristics for dynamic objects over a period of time to detect dynamic regions in video. The instantaneous pixel speed on image surface relates to a particular object in the 3D space. Results from [54] were displayed in terms of the average hit rate, with results of 71.83%, 87.21% and 87.04%, respectively for window sizes of 5 by 5, 10 by 10 and 15 by 15, respectively. The main limitation was found to be that some objects were incorrectly detected as vehicles.
Another study considered the parallel optical flow based on Lucas-Kanade algorithm [55]. In this method, the foreground optical flow detector detected the object with binary computation conducted to define rectangular regions surrounding the detected object. The findings showed success rate of 98%, which was determined accurate for front facing camera. The study reported limitations for vehicles being missed by optical flow detector in heavy traffic flow conditions, where frames contained too many images in the same scene.
In [56] proposed moving vehicle detection algorithm based on optical flow estimation on the edge image. What was unique about the method was the implementation of the pyramid model of the Lucas-Kanade optical flow to calculate optical flow information for a feature point set. However, the method still persisted from the same limitation of inability to detect vehicles effectively in heavy traffic flow conditions for the same aforementioned reason. The summary of predictive background models is shown in Table 3.   [45] Track a car in a campus. The car come into the scene, then stayed at a carport. The resolution of the test video is 384*288= Small data set.
Kalman filter [48] 97.50% Difficulties in managing dynamic background pixels Kalman filter [49] This method min value of MSE is 2.90 and maximum value of MSE is 8.40, which are less than 10. Min value of PSNR is 84.83 and max value is 95.30, which are more than 75%. Min value of Correlation coefficient is 90.6 and max value is 95.0, which are more than 75%. Min value of Similarity is 90.11 and max value is 95.44, which are more than 75%.
Inability to recognize the very small objects, thus requiring multimodal features to characterize objects.
Eigenbackground [52] The object detection result shows the output image we can observe the object without background Low accuracy and requirement of further objects to interact in one scene.
Optical flow [53] 98% Vehicles were missed by the optical flow detector due to heavy traffic flow conditions.
Optical flow [56] The vehicle velocity with high accuracy of about 0.212 cm/sec and within a Percentage error of 0.986%.
Fails to detect front view vehicle.

Appearance based techniques
Whereas motion segmentation methods could detect only motion, appearance-based methods seek to detect stationary objects in either image or video by way of their features. Utilization of visual information, such as color, texture and shape for vehicle detection still needs pre-requisite information [57]. Thus, feature extraction is used to compare between 2-D images with real-world 3-D images.

Feature based techniques
Feature-based techniques employ coded descriptions for the features, in order to describe the visual appearance of vehicles. Various features have been used in vehicle detection, including local symmetry edge operators [58]. This method is sensitive to size and illumination variations, thus requiring more spatial invariance edge-based histograms, as was formerly applied by [59]. Recently, features have evolved into more advanced ones enabling direction and classification of vehicles.

Haar-like features
Haar-like features refer to the formation of sum and difference of rectangles for image patch, describing grey-level distribution of adjacent regions [60] Filters used to extract features comprise two, three or four which could be positioned and scaled accordingly. Output of filters are computed by adding the pixel values for the grey region and white region one by one, and then normalizing the difference between two.
These features have been formerly used to detect vehicles from still cameras. Furthermore, statistical data could be computed, including average traffic speed and flow. The framework comprises three main stages, including first detecting the vehicles using Haar-like features. Several studies have employed training of cascaded Adaboost classifier [61,62], while others detected highway vehicles [63] and urban vehicles [65]. One study employed LBP features to train boosting classifiers for vehicle detection for license plates [? ], while another study used MLP ensemble in conjunction with Haar-like features [66].
In general, Haar-like features were found to possess high computational efficiency, while still being sensitive to vertical, horizontal and symmetric structures which increases their applicability for practical applications. Table 4 details the most relevant and recent studies on the topic. One study employed the method for a bicycle, concluding that it could still not perform for a greater number of vehicles [62]. Another study [65] found the accuracy for vehicle detection to be 80% with 120 false alarms resulted.

TELKOMNIKA Telecommun Comput El Control
Ì 845 Table 4. Feature based techniques Method Accuracy Gap SIFT [57] matching locations and scales (Match %) =78.6 match in orientation (Ori%)=71.8 Need to work on building models from multiple views that represent the 3D structure of objects.
SIFT [69] 94.13% Fails in processing speed due to slow time processing.

SURF [? ]
Average accuracy using FID scheme :94.94% Fails in occlusion, bad lighting and non-front view.
SURF [70] A to B camera :72.92% B to C camera :80.32% Low accuracy and slow.
HOG [71] Classification accuracy = 92.1% Requires work on classifier with frame to frame tracking that will provide the opportunity to demonstrate the full potential of the algorithm on partially occluded objects.
HOG [72] Test accuracy = 93%. Errors in the system due to inaccurate hypotheses generation.
HOG [73] average vehicle detection rate of 97% The calculation costs could potentially be reduced to a range of real-time processing for autonomous driving.
Fails to detect vehicles in complicated scenes with multiple occluded objects.

Fails in occlusion and large number of vehicles.
Haar-like-features [65] 80% accurate detection with 120 false alarm. Small number of vehicle used for detection.

SIFT (Shift-invariant feature transform)
David G. Lowe proposed the shift-invariant feature transform (SIFT) to transform images into a huge number of local feature vectors [57]. Each local feature was the same for image translation, scaling and rotation, with minor differences in illumination and 3D projection. The main limitation given in the study was the need for multiple views of the 3D structure for representation, Some works have been done using simple SIFT feature in [75, 60? ] In another study, X.Chen [69] proposed a method for UAVs using the SIFT method in conjunction with the implicit shape model (ISM), finding a detection accuracy of 94.14% from a training sample size of 1,110. The main limitation was its slow processing speed. [68] utilized a similar SIFT method resulting in a detection accuracy of 87% which is still considered relatively low. Thus, from the aforementioned studies utilizing SIFT, the conclusion was made in the studies that SIFT is computationally expensive, as the gradients of individual pixels need to be computed.

SURF (Speeded up robust features)
The speeded up robust features (SURF) is a scale and rotation invariant interest point detector and descriptor which, compared to SIFT's computational complexity, is much lower due to the substitution of Gaussian filter with a box of filters which slightly affect performance [78]. This algorithm employs a Hessian matrix approximation on integral image to local the points of interest, with second-order partial derivatives describing local curvatures [79].
Works in [? ] and [? ] both employed this method for vehicle detection, finding vehicles per frame to be 21 (frames/sec). In another study [70], a GPU based multiple camera was used to match unique representations of vehicles. The accuracy was found to be 94%, with limitations including rotation instability, illumination variations and occlusion. The low accuracy and computational time were suggested to be improved for future studies.

HOG (The grid of histogram of oriented gradients)
The grid of histogram of oriented gradient (HOG) [80] was initially meant to detect pedestrians but was later expanded for vehicles using 3-D model surface instead of 2-D grid of cell to generate 3-D histogram Ì ISSN: 1693-6930 of oriented gradient [71]. It calculates image gradient directional histogram, an integrated presentation of information pertaining to gradient and edge. Studies have shown accuracies of 92.1% [71], 93% [72] 94% [73] and 95.26% [74], with the main limitation given that the method needs to track the classifiers frame by frame, particularly for partially occluded objects [81]. Nevertheless, the improved computational efficiency compared to SIFT is still noted in the studies [82,83].

DISCUSSION
Most of the studies done have dealt with highway and urban roads and the most popular techniques used in recent years were either background subtraction or feature-based techniques, and especially SIFT, SURF and HOG. Despite all these techniques, there are some issues that have negatively impacted such systems. Examples of such challenges include camera view and operating condition, which present additional limitations. ITSs face a number of difficulties, especially with urban roads traffic scenes and intersections in which heavy traffic, vehicle occlusion, and camera placement affect the performance of the systems. These challenging issues still require additional research and development, especially in the case of urban roads. In recent years, studies have focused on how to detect vehicles in complex scenes and how to detect vehicles under occlusion objects or in poor lighting situations. Researchers and also tried to use huge vehicles datasets in their efforts to achieve high accuracy.

CONCLUSION
This paper has presented different vehicle detection techniques in video-based traffic surveillance and monitoring systems. Vehicle detection falls into two major branches: motion-based and appearance-based techniques. Both techniques can be applied to isolate vehicles from the background scene with varying computational complexity and detection accuracy. Following the descriptions of various existing techniques, it can be observed that most methods achieved high levels of accuracy. However, most of detection systems still have some limitations which negatively impact the accuracy levels. These limitations include poor lighting, weather changes, occlusion, and shadows. As a future direction for further research on this topic of vehicle detection, most researchers will focus on detecting vehicles under occlusion and detection of the front and rear view of the vehicle under poor lighting conditions while removing the shadow from the scene.