Parallel WiSARD object tracker: a ram-based tracking system

This paper proposes the Parallel WiSARD Object Tracker (PWOT), a new object tracker based on the WiSARD weightless neural network that is robust against quantization errors. Object tracking in video is an important and challenging task in many applications. Difficulties can arise due to weather conditions, target trajectory and appearance, occlusions, lighting conditions and noise. Tracking is a high-level application and requires the object location frame by frame in real time. This paper proposes a fast hybrid image segmentation (threshold and edge detection) in YcbCr color model and a parallel RAM based discriminator that improves efficiency when quantization errors occur. The original WiSARD training algorithm was changed to allow the tracking.


INTRODUCTION
Surface targets (ships or enemy vessels) tracking is a task of great importance for warships. Advances in electronics and the emergence of new technologies enabled the design and implementation of modern weapons. Such weapons can be supported by a tracking system that calculates the best direction for a shot.
Video tracking is a tool that replaces or assists radar tracking systems. Many factors difficult radar tracking, including clusters, sectors not covered by radar, electronic countermeasures systems and high cost. This paper presents an efficient video target tracker based on the WiSARD weightless neural network, which is able to work in real time and can compensate quantization errors generated by the image segmentation. Quantization in this paper means the conversion of pixel values to a binary representation. The WiSARD neural model is based on RAM-based neurons and various RAM nodes sizes in networks working individually and in parallel were tested together with image segmentation methods and color models. Each target is searched only in a small region called the region of interest (ROI) in order to reduce execution/processing time.
Tracking is generally a challenging problem [1] and has been widely studied. It is used for surveillance [2], target tracking, human gestures recognition, monitoring of traffic on the streets, and even to help a robot to do inertial navigation. Trackers were developed based on background removal [3], feature points [4], feed-forward neural networks [5], fuzzy logic [6] and others methods. Trackers based on weightless neural networks (WNN) are not widespread or doesn't exists. This article introduces PWOT, an object tracking system based on the WiSARD WNN.
The complexity of tracking sea surface targets in videos comes from several factors such as climate and sea conditions, target speed and appearance and presence of non-hostile vessels. The proposed Parallel WiSARD Object Tracker (PWOT) [7] (figure 1) has tree components: an object detector ObjDet based on the WiSARD WNN (first RAM memory), a second RAM memory (RAM2) and a position predictor. At each frame, the ObjDet returns the target position and writes it into RAM2. The WiSARD has a group of discriminators. Each one acts as a classifier, recognizing a different class of bit pattern. At the first frame, all discriminators are trained with the quantized pixels (target model) inside a frame region defined manually by the operator, the selection window (SLW). At the following frames, the ObjDet receives as input the quantization result of all pixels inside the ROI. Each discriminator tries to recognize the target in a different region inside ROI. Comparing the discriminators responses, the ObjDet defines the target position. The position predictor estimates the object position in the next frame. The ROI center will be moved to this position at next frame. It was developed in C++ and QT. The videos run on a HP Lap-Top with 298MB RAM memory, 1.14GB HD, AMD Turion processor and a Linux operating system. We used an outdated computer to test the effectiveness of this approach on a lagged device.

THE WEIGHTLESS NEURAL NETWORK WISARD
WNN is an important branch of research related to neural networks. The neuron input and output are sets of bits and there are no weights between neurons. The activation function of each neuron is stored in look-up tables that can be implemented as RAM memories [8]. The training of an artificial neural network (ANN) involves the adjustment of weights. Unlike these networks, the WNN training process is carried by modifying the words stored in the look-up tables, allowing the construction of flexible and fast training algorithms. With the use of look-up tables, any activation function can be implemented at the nodes, since any binary value can be stored in response to any set of input bits during training. The fast training speed is due to the mutual independence between nodes when a new input data is presented. The training process of an ANN changes the weights values and the network behavior relative to patterns previously trained is modified.
The WiSARD neural network [9] is a WNN where neurons are implemented as RAM memories. The impossibility of implementing XOR function with the perceptron is a bypassed problem and the WiSARD can be trained in very short time, which is essential for tracking.
The WiSARD is designed for pattern recognition but can also be used for other purposes, such as target tracking (this paper), the automatic movement control of an offshore platform [10], pattern recognition [11], minimization of the neuron saturation problem [12], development of surveillance systems [2], inertial robot navigation [13] and the diagnosis of neuromuscular disorders [14]. The WiSARD can be easily implemented in hardware. The nodes can be easily implemented as RAM memories, and hence, the PWOT hardware implementation is straightforward.
The WiSARD is built grouping a set of basic elements called discriminators. Each discriminator is a set of k RAM memory nodes (figure 2), each addressed by N bits (N-input RAM node) designed to recognize a different class of bit pattern. Each RAM stores 2N words of one bit [8]. At an image containing k.N pixels, one quantized pixel represents one bit. k sets of N randomly chosen bits are connected to k RAM bus addresses. The discriminator response is a weightless sum of all k RAM accessed words. Binding of k.N bits at bus addresses is called input mapping. Once the input mapping is set, this remains constant during the training and classification phases.
Suppose that a WiSARD discriminator has k RAM nodes. Before the discriminator training phase, the bit "0" is written at all RAM accessed addresses. A training vector set X with class A examples, k.N bits size each, is prepared. During the training phase, each example from X is placed at the discriminator input, one by one. The bit "1" is written at all RAM accessed address of the discriminator being trained. Another training vector set Y with class B examples is used to train another discriminator. In the classification phase, one test vector is placed at the WiSARD input. The vector is classified as a class member represented by the discriminator that returns the greatest response. Each discriminator recognizes one different letter and is trained with quantized images containing only the letter it is designed to recognize.
The proposed tracker PWOT was tested on sea surface targets. At each video frame, pixels are quantized and placed at the ObjDet input (first PWOT component). ObjDet is based on the WiSARD. At each frame ObjDet try to find the target inside the ROI. The ObjDet discriminators are not trained in the same way as the original WiSARD training. To find the target position, all discriminators are trained with the same input bit pattern: the SLW quantized pixels. PWOT should return the target position. For this task, the WiSARD training procedure was modified. All discriminators are trained with the same target model (quantized SLW pixels). However, each discriminator tries to recognize the target at a different region (figure 3). The search region covered by each discriminator is spatially displaced from the others. The discriminator that returns the highest response with a confidence level C (relative difference from the second highest response) greater than a minimum value reveals the target position.
The search region size of one discriminator is equal to the SLW size. The union of all discriminators search regions forms the ROI. The search region of each discriminator has nonempty intersection with respect to their neighbors (Fig 4). The algorithm steps are: Step1 (first frame): The operator selects manually the target in the first video frame to set the SLW [15]; Step2 (first frame): convert the pixels inside the SLW to a binary representation (pixel quantization); Step3 (first frame): Train the WiSARD (each ObjDet discriminator) with the SLW quantized pixels; Step4 (following frames): Track the target (trained bit pattern) with PWOT. Repeat step 4 at next frame; The ROI geometric center is initially located at the SLW geometric center. The position predictor estimates the target position at next frame using a simple Kalman filter. Then, PWOT moves the ROI geometric center to the estimated position (step 4). Step4 summary: ObjDet is a set of discriminators implemented as RAM memories (RAM1). At each frame, the ROI pixels are quantized before PWOT acts. The quantized pixels contained inside the search region of each discriminator are placed at their input as a binary test vector. All discriminators try to recognize the target inside their search region. The discriminator that returns the highest response with a minimum degree of confidence C reveals the target position ( figure  3). ObjDet writes the target position at a FIFO stack (RAM2). Only the N lasts target positions are saved. The position predictor reads all the N target positions to estimate it. The ROI geometric center is moved to the estimated position. This way the ObjDet will search the target where it will probability be in the next frame. When PWOT perceives that the answers of two discriminators are similar (low value C), the WiSARD configuration must be changed online.
This algorithm can become automatic by substituting the first step with a moving detector and a classifier. The classifier should decide if the object inside one moving region is a vessel or not and return the target bounding box (SLW). Fast algorithms for detecting moving objects such as background subtraction [3] and frame difference [4], and for classifying objects based on WiSARD [14], ANN, SVM or Bayesian networks [1] are extensively used.

PWOT
This section addresses the simulations performed to improve the Parallel WiSARD Object Tracker efficiency.

Video characteristics
All videos have resolution of 640x480 pixels (figure 4). The scene contains only one surface target with not real and very difficult maneuvering.

Image Segmentation
The novel hybrid image segmentation method used in all simulations up to the discriminator setting GP6 (section 3.3.2) receives as input RGB images. The method segments the image by sampling, threshold and edge detection: Step 1-The SLW center cs selected by the operator belongs to the target. Four points at the target border are calculated by sliding the Prewitts edge detector in transversal and longitudinal directions departing from cs (figure 5). The transverse (p1-p2) and the longitudinal bands (p3-p4) pf are good pixel samples to calculate the thresholds because ships usually have little tonal variations. To minimize the influence on the simulation, the initial SLW was kept constant for each video. The sampling provides a rapid image segmentation, essential for tracking. The standard segmentation methods are computationally more expensive.
Step 2-Calculate the mean med_pf and standard deviation s_pf of the pixels pf at R, G and B channels.
Step 3-Sample the pixels pw belonging to the sea and calculate the mean m_sea at R, G and B channels.
Step 4-Calculate (1) six thresholds in the first frame, a pair for each RGB channel. Each pair defines a threshold rule that includes the target pixels and exclude background pixels. x, y and z values are such that best separate pf from pw pixels by thresholding.
Step 5-Quantize the ROI pixels at each frame after the first using the six calculated thresholds. This segmentation algorithm performes better than the following methods: threshold [16], edge detection [17], threshold with edge detection [18], region growth [16], Watershed Transformation [19] and Split and Merging [16].

Experiments
All discriminators initially have 9-input RAM nodes (section 3.3.1 and 3.3.2). Other RAM sizes were tested next. The discriminators centers are displaced from each other xp pixels relative to x axis and yp pixels relative to y axis. The geometric center of each discriminator search region forms a grid of points GP. Tracking errors occur when the tracker indicates a position with less than 50% ratio of intersection/union with the right bounding box.

Setting the WiSARD discriminator
The S1 simulation (table 2) aims to choose the optimal discriminator quantity. The greater the quantity, greater ROI becomes and lower is the probability of target missing, but greater is the mean execution time ET of the tracking algorithm at one frame. However, at VF video the tracker missed the target 3 frames before with GP3 compared to GP1 and GP2 due to quantization errors. At VS, the target is lost in the third frame with GP1, GP2 and GP3. The target direction changes instantaneously. It comes out from the ROI. But with GP4 no target loss occurs because the search region size is bigger.
With a more accurate quantization, the tracker fails less. At VF, VC, VUW1 and VUW2, GP4 provides worse performance compared to GP1, GP2 and GP3 with respect to the frame number that the first error occurs despite ROI being bigger due to SLW pixel quantization errors (wrong target model). With GP5, the performance is inferior compared to GP4 at VF and VC for the same reason. The position predictor indicates a dubious region some frames earlier. The introduction of position predictor based on Kalman filter improves the tracker performance on most videos. However, with GP5 the tracker fails before compared to GP4 at VS due to the unpredictable target movement that tricks the predictor. Considering the ET and the performance related to the frame that the first error occurs, the GP5 setting was more efficient proving the importance of including a position predictor in PWOT.

Improving the hybrid image binarization method proposed
Several image segmentation methods and color models were tested. The combination of color model YCbCr with a hybrid segmentation method achieved the best results, being used in all settings from GP7.  The improved image segmentation method algorithm has two steps: Step 1-Calculate the pixels pf the same way as in step 1, section 3.2; transform from RGB to YCbCr model and calculate the mean and standard deviation of the pixels pf at Y, Cb and Cr channels; Step 2-Calculate (2) six thresholds. At all videos, x=3, y=3 and z=1.5 are sufficient to eliminate tracking errors caused by an inefficient quantization (table 3). At the frames after the first, the ROI pixels are transformed to YcbCr model and quantized with these thresholds.
The introduction of a second discriminating layer, GPD, denser and smaller than GS, improved the tracking at VF, VT and VS, but at VUW2, using GP6 setting, the WiSARD response is greater than using GP5 on a sea surface region 9 frames earlier caused by quantization errors. VUW1, VUW2, VB1 and VB2 are very challenging because the target-sea contrast is low, producing more quantization errors. Using a more efficient and innovative quantization method, the GP7 setting allows a right target track until the end of all videos, proving the efficiency of this method for tracking sea surface targets and the quantization influence on the tracking performance.

Optimal RAM node size
Seeking to lower ET, a GP7 was used to check what RAM size reduces ET and maintains the same efficiency. Addressing RAM node size varying from 1 bit until the largest size allowed by the computer memory ( figure 6). The graph shape is the same to all tests, changing only the node size where ET starts to increase very much (from 16 to 22 bits). ET increases as the computer RAM memory limit is exceeded.
Seeking to study how the RAM node size can compensate quantization errors and improve the tracker performance, the quantization quality was purposely made worse by various degrees, modifying the 6 thresholds values. All supported sizes were tested. Considering only ET, nodes with bus address size between 2 and 14 bits have similar performance, but with 3 bits the tracker compensates better the quantization errors, failing less in about 70% of the cases. It is concluded that the node size influences the robustness against quantization errors.

Quantity of Discriminators
VF, VC and VUW1 were used to verify if there is a relationship between density/number of discriminators and robustness against quantization errors. At these videos the quantization quality was purposely worsened. Four discriminator settings were tested. VF and VUW1 were chosen by producing a worse quantization with a smaller threshold variation. VC was chosen because there is a cruise ship on the horizon similar to the target. The tested trackers used 3 input RAM nodes (3-bit bus address) because if performed best at previous simulations. T5, T6 and T7: ET (ms) for tracking with GP5, GP6, and GP7 settings respectively; M5, M6, and M7: number of the first frame that the tracker fails with GP5, GP6, and GP7 settings respectively. GP5 has the same settings of simulation S1 (table 2). GP6 and GP7 have two discriminator layers: the first, GP, has 20 x 20 discriminators with sxp=5 and syp=5 and the second, GPD, denser than GP, has 10 x 10 discriminators with sxp=2 and syp=2. Both layers are centered on the target position provided by the predictor. GP5 and GP6 uses the segmentation method presented in section 3.2 and GP7 uses the improved segmentation method (this section).
Observing table 4, it is concluded that there is a relationship between the discriminator distributions and robustness against quantization errors. Comparing GP8 to GP9 and GP10 to GP11, it is clear that the tracker followed the target by more frames using discriminators spaced 1 xp and yp pixel than 2 pixels, although the first use less discriminators (25%). GP11 was the most efficient, surpassing the other in 8 of 9 tests.

Tracker with two parallel WiSARD neural networks
The WiSARD recognizes best a pre trained pattern (greater response) when pixels PO belonging to the object are quantized with a value and pixels PNO that do not belong to the object are quantized with another. If pixels are wrongly quantized, the network will respond with a smaller sum. In most segmentation methods, the probability of a PNO pixel being wrongly quantized is greater than that of a PO pixel, because main aim of these methods is to include PO pixels in a region RG to meet their requirements RQ. The segmentation focus is the inclusion of pixels PO in RG. The exclusion of PNO pixels from RG is merely a consequence of not respecting RQ. Excluding a PO pixel from RG is worse than adding a PNO pixel in RG. We conclude that there is a greater probability of occurrence of PNO pixels quantization errors than of PO pixels. The quantization proposed in this paper includes PO pixels in RG if their color belongs to an interval defined by thresholds L1 to L6. The PNO pixels exclusion from RG is not the priority. The exclusion only occurs if the PNO pixels are sufficiently different from PO pixels.
Designing a network where PO pixels have more weight in response entail more hits. The template matching algorithm of some Kernel based object trackers [20] associates weights to the SLW pixels. Pixels next to the SLW border have lower weights. The weighting increases the robustness of the matching since the peripheral pixels are the least reliable, being often affected by occlusions or clutter. Read [20] to understand the mathematical details.
The WiSARD, by definition, has no weights. To circumvent this problem, two parallel networks with different RAM node sizes at their discriminators can be used. Each discriminator response of the parallel network is an add of two discriminators responses working alone (3). The discriminators of each network cover a disjoint region and the union of both regions forms the region covered by the parallel discriminator. A discriminator with 3 input RAM nodes generates a higher response than a discriminator with 15 input RAM nodes by having a greater number of nodes. The first discriminator, which has more nodes (3 input RAM nodes), receives as input the quantized pixels inside the central part of the SLW (probably where the target is) and other discriminator, which has fewer nodes (15 input RAM nodes) receives as input the quantized pixels inside the peripheral part of SLW (probably where background pixels are)( figure 7). The word Parallel in PWOT comes from the innovative way of placing two discriminators in parallel to improve the WiSARD performance when quantization errors occur.
By varying the quantity of pixels wrongly quantized, 129 simulations were performed to find the pixel percentage P of SLW covered by the discriminator with 3 input RAM nodes (pixels inside the blue region) and the (1-P) pixel percentage (pixels inside the pumpkin region) covered by the other one (with 15 input RAM nodes) to increase robustness against quantization errors (figure 7). For these tests we used the GP11 discriminator setting because it is the most effective against quantization errors (section 3.3.4).
The simulations showed that there is always a pixel percentage P (or percentage range) covered by the discriminator with 3 input RAM nodes which improves the performance when quantization errors occur. Using the optimal value of P, two WiSARD working in parallel is always more efficient than one network having 3 or 15 input RAM nodes working alone. Optimal value of P is not equal to all videos. At VF, P is between 40% and 45%. At VC, P is between 80% and 85%. At VUW1, P is between 17% and 22%. The optimal value depends on factors such as target and SLW sizes and target-background contrast. The online calculation of P can be done observing the decrease of the confidence level C or when the predictor perceives a failure. MB: Relative percentage rise of wrongly quantized pixels; M8, M9, M10 and M11: first frame that a failure occurred using GP8, GP9, GP10 and GP11 settings respectively. GP8 comprises GP and GPD1. GP9 comprises GP and GPD2. GP10 comprises GP, GPD3, GPD4, GPD5 and GPD6. GP11 comprises GP, GPD7, GPD8, GPD9 and GPD10. GP has 20 x 20 discriminators with sxp=5 and syp=5. GPD1, GPD3, GPD4, GPD5 and GPD6 have 10 x 10 discriminators with sxp=2 and syp=2. GPD2, GPD7, GPD8, GPD9 and GPD10 have 5 x 5 discriminators with sxp=1 and syp=1. The layers GP, GPD1, GPD2, GPD3 and GPD7 are centered on the target position provided by the predictor. GPD4 and GPD8 are centered at the position of the greatest discriminator response in GP; GPD5 and GPD9 are centered at the position of second largest discriminator response in GP; GPD6 and GPD10 are centered at the position of third largest discriminator response in GP. The proposed parallel WiSARD discriminator has RAM nodes of two different sizes. It's reponse is an sum of two discriminators responses: the discriminator with more nodes receives as input the quantized pixels inside the central part of the SLW (blue region), the one with fewer nodes receives as input the quantized pixels inside the peripheral part of the SLW (pumpkin region). The green rectangle represents the SLW and the red one represents the ROI.

CONCLUSIONS
The results obtained by the simulations were better than expected. The computer used has little processing power, but the PWOT runs in real time, proving that the WNN WiSARD can be used for tracking. If a fewer quantity of RAM neurons and the best quantization (section 2.3.2) were used, the mean tracking time (ET) could be reduced in about 10 times or more without tracking failures.
The simulations in section 3.3.1 shows that the greater the number of WiSARD discriminators, the greater is ET, the greater is the ROI size and the lower is the probability to lose a fast target. When fewer quantization errors occur, the tracker commits fewer failures. The introduction of a Kalman position estimator improves performance without increasing considerably ET and it allows the tracker to check a tracking failure during the target detection phase.
A hybrid and innovative quantization method is described in section 3.3.2. When the target and background pixels are correctly quantized and increasing the number of discriminators around the position returned by the position estimator, the tracker follows the target by more frames without failure.
The tests performed in section 3.3.3 evaluated the changes in tracking performance when different bus address size nodes are used. The tests showed that bus address sizes between 2 and 14-bit have similar ET and are more efficient than other bus address sizes, though, 3 input RAM nodes compensate quantization errors more efficiently. An important conclusion is that the RAM node bus address size influences the robustness against quantization errors.
There is a relationship between the discriminator positions and robustness against quantization errors (section 3.3.4). Using fewer discriminators and less spacing in pixels among them, it permits a tracking with no failure by more frames.
An innovative way of using two WiSARD in parallel to increase the robustness against quantization errors was presented in section 3.3.5. The parallelism occurs at discriminator level (figure 7). Tracking with two parallel WiSARD is more efficient than using one because it can track the target by more frames when quantization errors occur. Thus, the parallel WiSARD can compensate the errors generated by the segmenter. The simulations showed that there is a percentage P of pixels covered by the WiSARD with 3 input RAM nodes and a percentage 1-P of pixels covered by the WiSARD with 15 input RAM nodes which causes an improvement of PWOT performance. P is not the same for all videos. The optimum value of P depends on factors like target size, SLW size and target contrast against the background. The online calculation of P can be done when the confidence level C is below a threshold or when the predictor points to a position that is impossible to the target be.
To improve the PWOT performance, it can be investigated new ways to distribute the discriminators, other kinds of predictors, new ways to organize two parallel neural networks, other kinds of features to train the network, methods to reduce the amount of features and the use of more than two WiSARD in parallel.