Robust Tracking Using On-Line Selection of Multiple Features

This paper presents a novel on-line feature selection mechanism based on mean shift tracking algorithm, which adjusts the weight for each feature and each bin in feature histograms during the tracking process, according to the discrimination between the appearance of object and background with different features. Then the tracking performance would be stable and reliable using those features combined by the weights. As the appearance model, we select features of gray level, Local Binary Patterns (LBP) texture and edge orientation. The gradient amplitude is supplement to the feature space for tracking. Experiments on two video sequences show the effectiveness of the proposed method.


INTRODUCTION
Real-time object tracking is a hot research topic in many computer vision applications, such as surveillance and human computer interaction. One of main factors which limit tracking performance is the lack of robustness.
Features of gray level [1,2], LBP [3,4,5,6] and edge orientation [7,8] are respectively used for mean shift tracking. However, the target presented by individual feature, may have a similar appearance to the background in some complex environment, which makes the tracking unstable. By using a "center-surround" approach to sample pixels from object and background, Collins et al. [9] backprojects the log likelihood ratio between foreground and background into a weighted image, and assign high weight to those pixels corresponding to histogram bins which are discriminative. However, weights get from log likelihood ratio function are not always reliable, and sometimes appearance of pixels from foreground with high weight still looks like background, as Fig.1 (b) shows. This is because that, the histogram of some feature is multimodal, and two successive bins may be not independent, furthermore, peak is more stable than bin. A Gaussian Mixture Model (GMM) is proposed in [10], from which the stable Gaussian component of object model can be obtained, as shown in Fig.1 (c).
As the target moves from place to place, both foreground and background appearance will change, and those features corresponding to changed appearance may be ineffective, while the others may be still valid enough. To improve robustness of tracking performance, we use gray level, LBP, edge orientation and gradient amplitude feature to characterize color, texture and structure of the target. However, once background changes, the discrimination between object and background of features will also changes. Thus for promoting the stability of tracking performance, it is necessary to find which and how those features are discriminative on line. In this paper, we firstly calculate statistic histograms of gray level, LBP, edge orientation and gradient amplitude feature. Then we figure out the discrimination of each feature histogram and each bin based on the distinction between appearance of object and background, and assign each feature and bin a weight accordingly. Finally, we obtain a novel feature combination to make tracking more robust. Figure 1. Weight image using "center-surround" approach: (a) the inner rectangle represents the foreground and the surrounding ring represents the background, (b) weight image by backprojecting the log likelihood ratio between foreground and background, (c) the stable part of foreground obtained by GMM in [4], (d) histograms for object and background, (e) loglikelihood ratio of foreground and background, (f) distribution of robust Gaussian component obtained by GMM in [4].

II. THE PROPOSED METHOD
During a tracking process, if only one feature is used to represent the target, the contribution of each bin in the feature histogram is various and continually changing. To track the target robustly, the higher contribution a bin provides, the higher weight should be assigned. We call these weights interior weight of features. Moreover, individual feature can not guarantee the accuracy of tracker. And if we use multiple features to track, the contribution of each feature is also obviously various. It is necessary to assign each feature a different weight accordingly, to make those discriminative features provide more contribution for tracking. We call these weights exterior weight of features. We will present how to calculate interior and exterior weight of features in the next two sections.

A. Interior Weight of Features
For an individual feature, bins that are discriminative between foreground and background can provide robust tracking performance. Assuming the distributions of gray level and gradient amplitude on image are multiple Gaussian, we use GMM to characterize these features. So their interior weight could be calculated by the stable Gaussian from [10]. While extracting the feature of LBP and edge orientation, pixels from flat area of image are discarded, so their distributions can't be Gaussian. As a result, we still calculate their interior weight using log likelihood function. Moreover, the range of log likelihood ratio L u obtained by [9] is between δ log and δ log − , which should be mapped into [0,1] by where max{L} denotes the maximum of log likelihood ratio between foreground and background in current frame while min{L} is the minimum of it.

B. Exterior Weight of Features
When tracking using combined multiple features, the contributions of different features are not the same for the change of scene. Those features characterizing the target, which have similar performance to the background once, is not reliable and should be assigned small weights. On the contrast, discriminative features should be assigned a high weight.
We still use the "center-surround" approach to evaluate the distance between foreground and background. There are three measures commonly used to calculate the distance between histograms: the Bhattacharyya distance, l 2 distance and chi-square distance. Denote (4) Fig. 2 illustrates the gray level histogram based distance between the object and surrounding obtained from equation (2), (3) and (4) by sliding the (red) bounding box around the object vertically and horizontally respectively. In ideal case, the distance should be short when the bounding box overlaps object heavily, while the distance should be long if the bounding box is far away from the object. Therefore, we chose the Bhattacharyya distance to compute the exterior weight of features. Moreover, in view of that scene's change will lead to the change of object appearance, if some feature of target candidate has a similar appearance with that of the target template in the current frame, it is considered valid and reliable. Hence the final exterior weight of features is

C. Histogram of Gradient Amplitude
For the instability of individual feature, gray level, LBP and edge orientation features are used to describe color, texture and structure information of target in this paper. When expressing the outline of object, edge orientation histogram loses the amplitude of edge points, which can not characterize the whole structure information of the target. So we put forward gradient amplitude histogram here. We have to solve two problems before computing the statistic distribution of gradient amplitude of the target in each frame: 1) The gradient amplitude of edges may change sharply, once the illumination in video changes.
2) Most pixels from video frames are located at flat area, which do not contain the information of structure. Furthermore, the number of them is so large that they ruin the statistical property of gradient amplitude histogram.
For the first problem, considering that gradient amplitude of edge points is always higher than that of points on flat area, we pick the maximum value of gradient amplitude in each frame as the upper bound, and allocate the gradient amplitude into l bins, so that pixels with high bin number are closed to edges. For the second problem, we set a gradient threshold to filter out those insignificant pixels. Denote grads(x i ) as the gradient amplitude of the pixel located on x i . The index of bin for the i th pixel is computed as where floor(.) is a rounding function, and max{grads} is the maximum of gradient amplitude on image. In our experiments we set Gradient amplitude feature is invariant to the 2D translation, scale transform and rotation. Combined with edge orientation feature, it can describe the whole structure information of object.

D. Tracking
In conclusion, denote q G , q T , q O , and q A as the gray level, LBP, edge orientation and gradient amplitude histogram of the target respectively, while p G , p T , p O and p A as the feature histograms of the target candidate.
where W G , W T , W O and W A are the exterior weight of gray level, LBP, edge orientation and gradient amplitude features respectively, w G , w T , w O and w A are the interior weight of features, and m 1 , m 2 , m 3 and m 4 are the number of bins in each feature histogram.
Using Taylor expansion around the equation (7), the linear approximation of the value ) ( y ρ is obtained after some manipulations as   where ω i is weight value of the pixel at location y i , given by In equation (8) where ) ( ' ) ( x k x g − = , and k(x) is a kernel function defined in [2].

A. Tracking Time Complexity
Our tracking algorithm can be summarized as three steps: calculating feature histograms, weights of histogram and the center location of target in next frame. Let N be the average number of iterations per frame in mean shift model and n h the number of pixels corresponding to the target. We conclude that the mean cost of the proposed algorithm given by where c H is the cost of histograms, c W is the cost of histogram weights and c MS is the cost of an addition, a square-root, and a division given by (3). In our experiment, gray level histogram has 32 bins, LBP histogram 5 bins, edge orientation histogram 16 bins, and gradient amplitude histogram 32 bins. The four feature histograms mentioned above are not joint together but combined by weight, so c H is the summation of histograms costs but not the product of them. Because calculating weights of histograms is on vector with low dimension, c W is also low. Compared with traditional mean shift tracking algorithm [2], of which the mean cost is given by where * N is the average number of iterations per frame by traditional algorithm, * H c is the cost of histogram, and * MS c is the cost of (3). For tracking algorithm by [2], there is neither cost of feature selection, nor cost of LBP, edge orientation or gradient amplitude histograms corresponding to object, even cost of histograms corresponding to background. Hence the cost of our algorithm is about two times the cost of [2], which is proved by experiments on two video sequence illustrated by Fig. 5 and Fig. 9.

B. Experiment Result and Analysis
Taking two video sequence "woman" and "sylv" for examples, using features of gray level, LBP, edge orientation and gradient amplitude to track in experiments, we've compared the performance between three tracking algorithms: our algorithm, mean shift algorithm with interior weight of features but no exterior weight, and traditional mean shift tracking algorithm with four features, as shown in Fig.5 and Fig.9 respectively.
As shown in Fig. 3, the method proposed in this paper has a more stable performance. We have calculated exterior weight of four features in each frame, illustrated by Fig. 4. Figure 3. Tracking result of video sequence "woman" for three methods: red rectangle for our method, green rectangle for mean shift tracking using interior weight of four features, and blue rectangle for [2]. Frames 13, 121, 145, 199 and 265 are displayed. As shown in Fig. 4, gray level, edge orientation and gradient amplitude feature of the target increase or decrease alternately with the change of scene, while LBP feature is affected by background so much that the exterior weight of LBP is small from begging to end. For three of the four features being effective, none of three methods misses the target, but the tracking performance of our method is most stable. Taking frame 292 in video "woman" as an example, we backproject exterior weight of four features into weight images, as shown in Fig. 5: From Fig. 5 (e) and (h), we can see that white points (with high weight values) concentrate on object "woman", which means that, for the features of gray level and gradient amplitude, GMM is able to find distinctive pixels on the target. As shown in Fig.5 (f) and (g), the log likelihood function just filter out few points corresponding to background for the feature of LBP and edge orientation, which lacks recognition ability of the target. In frame 292, features of gray level, LBP, edge orientation and gradient amplitude are assigned exterior weight of 0.3953, 0.0898, 0.2266 and 0.2884 respectively. Because we have assigned high exterior weight to those discriminative features of gray level, edge orientation and gradient amplitude, and the interior weight separates the object from background for features of gray level and gradient amplitude, our method tracks the target stably on the current frame. Comparing the tracking results with the ground truth, we can see that, even though the center location error rate of each method is rather low, error rate of the method proposed in this paper is still 2.39 pixels per frame less than [2], as shown in Fig. 6: Figure 6. Center location error rate for video sequence "woman".
Then we check the tracking result of video sequence "sylv", shown by Figure 7. Tracking result of video sequence "sylv" for three methods: red rectangle for our method, green rectangle for mean shift tracking using interior weight of features, and blue rectangle for [2]. Frames 470, 858, 920, 1055 and 1124 are displayed.
As illustrated in Fig. 7, our method is able to track the target stably, while the other two failed on last frames. Exterior weight of four features are calculated, shown by As shown in Fig. 8, during the tracking process, the exterior weight of gray level is always higher than that of the others. For the most of time, gray level is the only feature that can distinguish the target well, while the other three are not so valid. Therefore, tracking without exterior weight of features fails. Taking frame 1055 as an example, weight images by backprojecting exterior weight of four features are shown as Fig. 9. In frame 1055, we see that, the target is distinguished by those weight points with high values corresponding to features except edge orientation. At this time, the exterior weights of gray level, LBP, edge orientation and gradient amplitude are 0.7781, 0.0163, 0.0944 and 0.1112 respectively, which means, in the current frame, only the feature of gray level is discriminative between object and background, and has a similar performance with the target template. In another word, the contribution of gray level feature to tracking is greatest. Compared our tracking rectangle in "sylv" with the ground truth, the center location error rate is given by Fig. 10.  As shown in Fig.10, feature selection using interior and exterior weight keeps running during the tracking process, which is able to reduce center location error rate of 15.1 pixels per frame.

IV. CONCLUSION
This paper presents a tracking algorithm using on line selection of multiple features, which has a stable tracking performance. We use both interior and exterior weight to improve the discrimination between object and background. Moreover, we use features of gray level, LBP and edge orientation to describe the color, texture and structure of object, and supplement gradient amplitude feature to replenish the structure information of object, which makes tracking performance more robust.
The distribution of LBP and edge orientation for image cannot be represented by GMM, so our future work will focus on searching the discriminative part of those features and bringing more distinctive features into our tracking model.