Implementing the minimum-misclassification-error energy function for target recognition

The authors demonstrate through an example that the minimum-misclassification-error (MME) classifier can dramatically outperform the sigmoid-least-mean-squares ( sigma -LMS) classifier. Three energy functions that are useful for classification goals other than simply minimizing the misclassification rate are proposed. First is a minimum-cost function, which allows different costs for misclassifications from different classes. Second is a Neyman-Pearson function, which minimizes the number of misclassifications for one class given a fixed misclassification rate for the other class. Last is a minimax function, which minimizes the maximum number of misclassifications when the a priori probabilities of each class are unknown. Unlike their classical classifier counterparts, these energy functions operate directly on a training set, and do not require that class probability distributions be known.<<ETX>>


Introduction
For automatic target recognition or other types of pattern recognition, we want a classifier that minimizes the probability of misclassifying test data. When the network weights are computed, the distributions of test data are known only through the training set. Therefore, given a training set that adequately represents the underlying class distributions, we want a network that minimizes the number of misclassified training samples. It has been shown [l] that LMS (either a-LMS as normally used in backpropagation [2] or linear-LMS as used in the Widrow-Hoff algorithm [3]) does not produce a minimum-misclassification-error (MME) solution. Minimizing the misclassification error is vital in military applications, since a misclassification can produce a mission failure (for cruise missiles, ship defense, etc.).
We have proposed [4] an MME neural network to minimize the misclassification error. Others have also recognized that this is a desirable goal [5-71. We demonstrate the MME energy function together with three new extensions: a minimum-cost classifier (which allows different costs for misclassifications from different classes), a Neyman-Pearson classifier (which minimizes the misclassification for one class for a given misclassification rate for the other class), and a minimax classifier (which minimizes the maximum misclassification rate possible when the a priori probabilities are unknown). The namesakes of these classifiers [SI require that the class distributions be known, but these new classifiers operate directly on a training set.
The neural network implementation of these classifiers is more powerful than the classical paradigm because the class density functions do not need to be known a priori. These neural network implementations also do not require that the densities be estimated from the training set before determining class boundaries; the neural network computes optimal class boundaries directly from the training set. Also, a neural network implementation allows massively parallel computing for real-time learning and recall.
Section 2 formulates all of the classifiers. Section 3 details the MME energy function and provides examples. The three new classifiers are demonstrated in Section 4. Section 5 provides comments on implementation and Section 6 a conclusion.

Formulation
Following notation from [8,9], let wi denote the i-th class, P(wi) be the Q priori probability of that class, and p(z(wi) be the probability density for a l-D feature measurement z from w j . In a two-class problem, there are two types of errors, those in which samples from w1 are misclassified and those in which samples U.S. Government work not protected by U.S. copyright.

IV-214
from w2 are misclassified. These can be written as where Qi is the region in which t is classified as wi. Then the total error is E = P(Ul)El+ P(W2)Ez.

(2)
For the case where we do not know the exact distributions and have only a training set, we approximate & I , ~2 , a d E 88 i1 = # misclassified class 1 training vectors/NI $2 = # misclassified class 2 training vectors/Nz, where Nl and N2 are the number of training vectors from w1 and w2 and N = NI + Nz. In the infinite sample case, 6 converges to E. Eqs. 3-5 are a general definition for our MME energy function, for which a specific equation is given in Section 3. Minimizing 2 minimizes the number of misclassified training vectors.
Of course, minimizing the probability of error is not the only possible goal for a classifier. If the costs of misclassifying each class are different (say c1 and c2 for w1 and U Z ) , then the goal is a minimum-cost classifier which should minimize C l i l + C262.

(6)
A Neyman-Pearson classifier minimizes ~1 for a fixed e2 = EO (i.e. fixed false alarm rate). We can write a energy function for this as where c is a weighting coefficient. Note that the second term becomes zero when 62 = EO. Both Eqs. 6 and 7 can specify various points along the Receiver Operating Characteristic (ROC) curve, but in different ways. Eq. 6 is more desirable when the misclassifications have different costs, and Eq. 7 is more desirable when a false alarm rate is specified.
The misclassification probability is a function of the a priori probabilities. In cases where these are not known or can vary, it is useful to minimize the maximum error that can occur for any set of a priori probabilities. This is called a minimax classifier. A derivation [8] shows that for the minimax classifier, 81 = ea. An energy function to minimize this is given by 61 + 462 -21)2.

(8)
Note that the second term becomes zero when 21 = 22. A straightforward extension (which we do not test in this paper) of the minimax classifier incorporates different misclassification costs for each class.
These types of classifiers already exist when the class densities are known, just as the Bayes classifier is the minimum-error classifier when the class densities are known. Our contribution is to extend all of these to the case of classifiers that are synthesized from a training set by minimizing an energy function.

Minimum-Misclassification-Error Energy Function
In order to define and demonstrate the MME energy function, we first introduce our notation. Let w be a L x 1 weight vector that connects an input vector x with a single output sgn(wTx), where the superscript T denotes transposition and sgn(.) denotes the signum function (sgn(z) = 1 if z 2 0; sgn(z) = -1 otherwise). For a two-class problem, a positive output indicates one class and a negative output indicates the other. A training set with N training vectors is denoted as x", n = l...N. Although we are demonstrating the MME energy function with a very simple classifier, the extension of the energy function to multilayer feedforward networks is straightforward.

IV-215
The commonly used u-LMS energy function is given for this simple classifier by N n=I where d" is the n-th desired output and U is a sigmoidal function. A more natural energy function for pattern recognition simply counts the number of training vectors that the network misclassifies. This is given by Eqs. 3-5, with EMME = E, and Eqs. where d" is now the desired output sign ( f l ) and step(%) = 1 if t 1 0; step(%) = 0 otherwise. When the desired output sign is the same as that of the actual output wTx", x" is correctly classified, the step function simplifies to 1, and the count of misclassifications is reduced by one. When the desired output sign and actual output sign differ, x" is misclassified, the step function simplifies to 0, and the count of misclassifications is not reduced. It is interesting to note that Eq. 10 is an L1 norm while Eq. 9 is an L2 norm (just as the city-block distance is an L1 norm while the Euclidean distance is an L2 norm).
To perform gradient descent requires that the energy function be differentiable. Since a step function is not differentiable, we approximate it as a sigmoid function of variable steepness. As the sigmoid is steepened, it better approximates the step function and the energy function better approximates E M M E . This sigmoid is given as where r is the steepening parameter. Note that u 1 ( z ) is the the standard sigmoid function used in backpropagation, and that U O ( Z ) is the step function in Eq. 10. Thus, the energy function we minimize with gradient descent is based on To demonstrate the MME and u-LMS classifiers, we use the training set shown in Figure 1 and a test set (not shown), each set having 100 training vectors per class. We adopt a technique described elsewhere [lo] so that w defines a hyperspherical (circular in 2D) boundary. A quasi-Newton method (BFGS) [ll]

is used to minimize E U -~~s and E M M E .
In

Simulations
We demonstrate the minimum cost, Neyman-Pearson and minimax classifiers on a two-class, two-feature problem. For all classifiers except the minimax, P(w1) = P(w2) = 0.5. Each class is drawn from a Gaussian probability density with covariance matrix C = I. The w1 (called target) density has a mean (2, 2) and standard deviations (1, 2), while w2 (called clutter) has mean (0, 0) and standard deviations (2, 1). One thousand training and one thousand test samples were generated for each class. Figure 2 shows the training set with a boundary computed using E M M E . The E M M E and E U -~~s classifiers misclassified 16.9% and 18.2% of the test set, respectively. The 1.3% difference is small, but is consistent with other test results (including Figure 1)

where E M M E outperforms E,,-LMs.
A ROC plots the probability of detection (PD) 1il vs. the probability of false alarm (PFA) i~. The MME classifier produces one point on the ROC curve, but other points may be more desirable. essentially identical. Either method may be preferable depending on whether the application is more easily characterized by different misclassification costs or by a desired false alarm rate. Note that for the Neyman-Pearson classifier, the plotted points occur at regular intervals that match the false alarm rates (e.g., 0.30, 0.35, 0.40, 0.50) specified in synthesis. Both methods are preferable to the suboptimal varying-threshold method. This is especially apparent above the '%nee" of the curves, where the two methods perform up to 5.5% better than the suboptimal method. At the knee, all methods perform identically. This is because the operating point for the MME classifier is at the knee ( P F A = 0.117 and PD = 0.753), and the other methods cannot perform any better at that or nearby points. Below the knee, the minimum-cost and Neyman-Pearson ROCs appear virtually identical to the MME ROC, but this is deceptive because of the very steep slopes of the curves. In fact, the minimum-cost and Neyman-Pearson classifiers outperform the MME below the knee by up to 6.5%-8% for a given false alarm rate. Specifically, for a false alarm rate of 0.007, for which all three classifiers generated an operating point, the test-set classification rate is 0.386,0.451, and 0.465 for the MME, minimum-cost and Neyman-Pearson classifiers, respectively.
For the training and test sets defined above, the minimax classifier (with c = 200 in Eq. 8) yields test set misclassification rates of EI1 = 0.182 and i1 = 0.183. Thus, the energy function operates as expected. For comparison, varying the threshold on the MME classifier so that EI1 = b2 for the training set yields E1 = 0.185 and i l = 0.182 on the test set. Thus, by chance, the suboptimal approach produced essentially identica.1 results. In general, the minimax classifier should perform as well or better than the suboptimal approach of varying the threshold of the MME classifier.

Implement at ion Comments
The feedforward operation (i.e., recall or testing) for the classifiers described in this paper is identical to that for the a-LMS classifier, so identical hardware can be used. The learning phase obviously differs. Figure 4 gives block diagrams for implementing eo-^^^ and EMME in batch mode. Note that (Eq. 9) involves a subtraction of the desired output, while E M M E (Eq. 12) involves B,, T -, 0 (Eq. 11) and a multiplication by the desired sign, denoted by an encircled d" .

Conclusion
We have demonstrated through a simple example that the MME classifier can dramatically outperform the u-LMS classifier. This is true because E,,-LMs does not minimize the training set misclassification rate (as others have also shown), while EMME does. Building on EMME, we have proposed three new energy functions that are useful for classification goals other than simply minimizing the misclassification rate. The minimum-cost energy function allows for different classes to have different misclassification costs. The Neyman-Pearson energy function allows the misclassification rate for one class to be minimized when the misclassificstion rate of the other class is fixed. The minimax energy function allows the worst-case misclassification rate to be minimized when the class a priori probabilities are unknown or can vary. These new energy functions were demonstrated through a simple example and shown to perform as desired and to outperform the suboptimal procedure of computing the MME classifier and varying its threshold. The network we used for demonstrating the concept was very simple (a single weight vector producing a hyperspherical bounday), but it is straightforward to apply these new energy functions to a multilayer feedforward network. Thus, one of these new energy functions, selected according to the particular classification goal, should be employed rather than & -L M S , as long as the training set adequately represents the test set and the best performance possible is required.