On the Estimation of Markov Random Field Parameters

We examine the histogram method for estimating the parameters associated with a Markov random field. This method relies on the estimation of the local interaction sums from histogram data. We derive an estimator for these quantities that is optimal in a well-defined sense. Furthermore, we show that the final step of the histogram method, the solution of a least-squares problem, can be done substantially faster than one might expect if no equation culling is used. We also examine the use of weighted least-squares and see that this seems to lead to better estimates even with small amounts of data.


INTRODUCTION
E will consider 2D random fields defined over a finite rectangular N M lattice of pixels arranged in a toroidal topology. Given a random field Y, we denote a given realization by y and the numerical value of the random variable Y i,j associated with the i, jth pixel in this realization by y i,j . Each pixel in the image has an associated set of neighbors denoted by h i,j , which is a subset of other pixels from the image (usually pixels that are nearby the i, jth pixel in some sense). The random field Y is a Markov random field (MRF) if P(Y = y) > 0 for all y and further P(Y i,j = y | Y s,t = y s,t for all (s, t) ¡ (i, j)) = P(Y i,j = y | Y s,t = y s,t for all (s, t) ¶ h i,j ).
In other words, all realizations of the random field are possible and each pixel in the image is conditionally independent of all other pixels in the image given the values of its neighbors. Throughout this paper, we will focus on the binary autobinomial MRF with a uniform second-order neighborhood structure (see [2], [3] for details). For this model, a pixel may take on values of either zero or one with the conditional probability density that Y i,j = 1 given the values of its neighbors being given by  (1) can be rewritten in the following form: The vector v is sometimes called the local interaction vector, b is the vector of Markov random field parameters, and the value n is often called the local interaction sum.

ESTIMATING THE MARKOV RANDOM FIELD PARAMETERS
An important problem is to estimate the MRF parameter vector b from one or more realizations of the MRF. There are two common methods for accomplishing this task. The first, known as the coding method (see Besag [2]), is a maximum-likelihood approach that requires the solution of a large system of nonlinear equations. The second, which we shall consider here, was first proposed by Derin and Elliot [1] and is known as the histogram method. This approach relies on a couple of clever observations. First, that there are only a small number of possible neighborhood configurations and an even more limited number of possible local interaction vectors. In particular, for the case under consideration, there are only 2 8 = 256 possible neighborhood configurations. Moreover, since each of the last four elements in the local interaction vector can take on only the values zero, one, or two, there are only 3 4   that a pixel whose index is i will be a one. Once we have estimated that value, then we can recover the associated local interaction sum associated with that index with Finally, we should be able to find the MRF parameters by solving the system where In practice, one usually looks for a solution in the leastsquares sense, since there will be errors in estimating the local interaction sums. Moreover, it may not be possible to estimate all of the n i , in which case some rows of the equation may be dropped. It is clear that there are two steps to this algorithm: 1) estimating the local interaction sums and 2) solving the least-squares problem.
We will discuss each of these in more detail in the rest of this paper.

ESTIMATING THE LOCAL INTERACTION SUMS
The histogram algorithm proceeds by looking at the entire image (or some subset of its pixels) and building an indexed histogram. In particular, for each pixel in the sample set, determine the pixels index and then keep running tallies: 1) s i -the number of times we have observed a pixel of index i whose value is one and 2) t i -the number of times we have observed a pixel of index i whose value is zero.
Once we have built the histogram, we must estimate the corresponding local interaction sums, n i . We will now examine three possible approaches to this problem. Note that to simplify notation, in the following estimate derivations we will assume we are working with a fixed index i and will drop the subscript.

The Classical Estimate
The estimator proposed by Derin and Elliot [1] is based on first estimating the mean parameter, p, of the Bernoulli random variable. Assuming that we have s + t observations of the random variable and that s of those observations are ones and t are zeros, then the standard estimate of the mean parameter is And hence a logical estimate of n is This estimator is well-defined provided s, t > 0. Furthermore, if we consider the estimate $ s,t as a function of its subscripts, then it is clear that this function is real-valued and skew-symmetric, which is consistent with the nature of the quantity being estimated. More importantly, since the sample mean in (3) is known to be a consistent estimate of p, it can be shown that as s + t , the quantity $ s,t is itself a consistent estimate of n provided it is defined and p ¶ (0, 1).
Unfortunately, there are problems with this approach. Most troublesome is the fact that the estimate is not defined if either s = 0 or t = 0, which is quite likely if we are given a small number of total observations s + t and is always true if we have fewer than two observations. 1 Since we lose a row from (2) for each of these estimates that is not defined, this can lead to rank-deficiency and preclude solution of the least-squares problem. The beauty of this estimate is that it is given in closed form and is very easy to compute.

The Minimum Mean Square Bias Estimate
In [4], Gurelli and Onural propose a clever alternative to alleviate some of the problems associated with the standard estimate. Their estimate is chosen so that it minimizes the mean square bias of the estimate over a range of possible values of the mean parameter p. In particular, given n observations of the random variable, they define their estimate % k n k , − so that it minimizes This approach eliminates some of the problems associated with the $ s,t estimate, since % s,t can be defined when s = 0, t = 0, and when s + t 1. Unfortunately, these estimates are not available in closed form, although they need only be computed once, after which they can be retrieved from a table. It is proposed in [4] that the % s,t estimates be used when 5 s + t 11 and that the estimates $ s,t be used when s + t > 11. When s + t < 5, they suggest that the corresponding row be dropped from (2), although this is not necessary. Furthermore, it is suggested that one use p j ¶ {p | p = 0.5 ± (0.05 + 0.01k), k = 0, 1, 2, ..., 15} for N = 5, 6, ..., 10 and p j ¶ {p | p = 0.5 ± (0.05 + 0.01k), k = 0, 1, 2, ..., 25} for N = 11. This author's experiments indicate that this approach works fairly well. However, there are still difficulties because the $ s,t estimator does not provide estimates when s = 0 or t = 0 in cases where s + t > 11, which does occur reasonably often. It is possible to simply drop such rows from (2) but one might prefer to extend the % s,t estimate to cover these cases. Unfortunately, this does not work well because for large N, the least squares problem in (4) becomes so ill-1. Practical experience shows that it is not uncommon even when one has 100 or more observations. conditioned that it is not possible to solve it reliably, even if one uses the most stable known singular value decomposition approach. Moreover, before the ill-conditioning swamps the problem, the computed % s,t estimates can become inconsistent (e.g., one sometimes encounters negative estimates of % s,t when s > t).

A New Estimate
Recall that we are interested in developing an estimate $ ν for the true value of based on observations of the Bernoulli random variable. One common and useful criterion for developing such estimates is to consider the mean square error, defined to be and choose the estimate that minimizes this quantity. If $ ν satisfies this criterion, then it is called a minimum mean squared error (MMSE) estimator.
For the problem at hand, we are given n independent observations of a Bernoulli random variable with parameter p. We will let & k n k , − denote our estimate of n given that k of the n observations were ones. Since we have n observations, the mean square error of the estimate is given by Clearly, if we are given p, then the problem is easily solved. Since p is not known, we will use a Bayesian approach to eliminate it from the problem. In particular, assume that p is itself a random variable with known distribution and minimize the expected value of the quantity in (5) over all possible values of the parameter p. In particular, find values of & k n k , − for k = 0, 1, 2, ..., n that minimize where the expectation operation is with respect to p. Because p is the parameter for a Bernoulli random variable, it is convenient 2 for our Bayesian approach to assume that p is a uniform random variable on [0, 1], that is p ϳ U(0, 1). With these assumptions, (6) becomes n k p p p p dp which can be written as 2. Although certainly not necessary, as we can choose any distribution we like over the admissible interval [0, 1] or any appropriately measurable subset. To complete our derivation of the estimator, we need to evaluate the definite integrals in (11). We will consider the denominator first. Begin by recalling that the beta function (Euler's integral of the first kind) is defined by This function is symmetric, B(z, w) = B(w, z), and clearly p p dp B k n k k n k Moreover, it is known that the beta function is related to the gamma function by which, using the fact that G(n + 1) = n! at integer values, implies that p p dp k n k n k n k Now let us consider the definite integral in the numerator. We begin by noting that where the psi function (also called the digamma function) is defined to be It is known that at integer values of the argument, we can write where g is Euler's constant.
We are led to the following formula for the definite integral in the numerator: This estimate C s,t is a real-valued skew-symmetric function of its subscripts just like the classical estimate A s,t . Note that this formula is not suggested for actual computation.  where e is a vector of ones, and s is a vector of partial sums of the harmonic series, that is, s = [0 1 3/2 11/6 25/12 ... ] T .
One can also write C in terms of a Lyapunov transformation of the Hilbert matrix, where L is a matrix with ones below the main diagonal and zeros elsewhere and H is the Hilbert matrix (i.e. H i,j = 1 / (i + j -1)).

The Mean Square Error of the & s,t Estimate
Now that we have derived a simple formula for the estimate, we would like to compute the mean square error of the estimate given that we have n observations. Recall that the total mean square error from (7) is Note that by virtue of the binomial theorem, the last integral simplifies to Moreover, recall from (11) that p p p p dp p p dp Note that for values of n > 400, the following is a good rational approximation to the total mean square error:

Asymptotic Behavior of the & s,t Estimate
We now show that the & s,t estimate exhibits the appropriate asymptotic behavior as the number of observations approaches infinity. In particular, we will show that, in the limit, it behaves like the $ s,t estimate as the number of observations gets large. As a starting point, recall that the limit lim ln And we see that asymptotically, & s,t $ s,t as the number of observed ones (s) and the number of observed zeros (t) both approach infinity.

The Minimum Mean Square Bias Estimate Revisited
An obvious alternative method for deriving a minimum mean square bias estimate would be to assume that p ϳ U(0, 1), as was done for the MMSE estimate. In that case, we find that we want to choose the % k n k , − to minimize n k p p p p dp Evaluating the integrals (as in the previous section) and setting the gradient to zero yields the following system of linear equations for the estimates: Note that the Hessian of (15) is positive definite (it is just a positive diagonal scaling of the matrix A) and hence solving (16) yields a global minimum. It is of passing interest to note that the matrix A is symmetric, countersymmetric, positive-definite, and doubly stochastic.
The main trouble with the minimum mean square bias estimate is that it behaves inconsistently. This is first seen when computing the estimates for n = 3. In that case, (16)  which are not what we expect, since we get some positive estimates where we expect negative ones, and vice-versa. This behavior happens with all estimates for three or more trials, and we see that this estimator is inconsistent.

SOLVING THE LEAST SQUARES PROBLEM
Once we have obtained estimates for the local interaction sums, then we can set up and solve (2) in order to obtain our final estimate of the MRF parameters. In this section, we will examine this process in more detail and show some different approaches. Note that although we describe these methods in terms of the binary autobinomial MRF with a second-order neighborhood structure, they can be applied much more generally.
We begin by recalling that the local interaction vector for the i, j pixel is  .
Note that the first element is always one, and each of the last four elements may be either zero, one, or two, that is Given any one of the 81 possible vectors, we can use the following formula to determine a unique associated index: Similarly, it is possible to go from any integer i = 1, 2, ..., 81 to its corresponding local interaction vector by applying Horner's rule to extract the digits in the base three expansion of i -1, which become the elements of v i . Of course, many other orderings are possible but we will find this one convenient.

Rapid Solution of the Full System
One of the most serious deficiencies of the $ and % estimates is that they are not always defined. As a result, there are usually some indices i for which n i cannot be estimated and hence the corresponding row must be deleted from (2) before solving the least-squares problem; we shall call this process equation culling. In contrast, the & estimate always exists. This means that it is possible to reach the second stage and solve (2) without removing any equations. If we can avoid culling equations, then, for a given model, only the right-hand side of the least-squares problem will change for each different realization of the random field. This implies that we can simplify the solution process, since we can precompute the pseudo-inverse, (V T V) -1 V T , of the full V and then form our estimate with a simple matrix multiply. We note that, in general, it is numerically unwise to compute the pseudo-inverse or use it as a method of solving the least-squares problem. However, we give it analytically here and will show that it leads to a very elegant algorithm because of its special structure.
It is tedious, but not difficult to show that the pseudoinverse of V can be written explicitly in the following form: where S ¶ ª 581 is a matrix whose entries are described in the following way. The first 3 5-i entries in the ith row are ones, followed by 3 5-i zeros, followed by 3 5-i minus ones. This pattern repeats until the row is filled. In particular, the first row is all ones, the second is 27 ones, followed by 27 zeros, followed by 27 minus ones, and so on.
We note that the easiest way to derive this fact analytically is to compute the QR-factorization of V via the Gram-Schmidt process. Once we have QR = V, then the pseudoinverse is given by R -1 Q T . For the case under consideration, it is then easily observed that Q T is sparse and can be diagonally scaled to have entries that are either zero or ±1. Assimilation of this diagonal scaling into R -1 unnormalizes Q and leads to the matrix S just described. Given this simple form, we see that the estimates for the Markov parameters are easily calculated with the following formulas: It is very noteworthy that these formulas yield the final estimates for the Markov parameters with only 4 53 + 80 = 292 adds and five divides. 4 If intermediate sums are 4. We count additions and subtractions together as adds. Moreover, we note that the divides can be replaced by multiplies if 1/54 and 1/81 are computed in advance. This leads to a negligible time savings on most machines and induces one additional rounding error in each estimate, so there is no general advantage to computing the estimates in that way. stored, then one can compute the estimates with only 160 adds and five divides. This is a very substantial savings over the 5 81 = 405 multiplies and 5 80 = 400 adds that are required to multiply a general 5 81 matrix against a column vector.
Although it appears that this rapid algorithm for computing the least-squares solution is very specialized to the specific MRF under consideration, it is in fact possible to derive similar forms for other MRF models.

Using Generalized Least Squares
Because we are not given the true values of the local interaction sums, n i , we must use estimates, $ ν i . Assume that the true value and the estimate are related by ν ν where n i is the error of the estimate. Then (2) can be rewritten as where we assume that the vector n = [n 1 n 2 ... n 81 ] T is a noise vector with zero mean and known covariance W. A common approach is to solve the generalized least-squares problem where B is the Cholesky factor of W. For our purposes, we will assume that the n i are uncorrelated so that W will be diagonal. This leads to a weighted least-squares problem [5] where one can vary the contribution that each equation makes toward the final solution. A similar approach was used very successfully in [6], but their algorithm relies on heuristic methods for determining the weights. In contrast, we will assume that the variance of n i is equal to the mean squared error of the estimate $ ν i , which is given by (13).
Specific algorithms for solving the generalized least-squares problem can be found in [7] and references therein.

NUMERICAL EXPERIMENTS
In this section, we present the results of some simple numerical experiments. We note that all experiments were carried out using synthetic MRF textures that were generated with a fast Gibbs sampler algorithm. For each MRF texture, its corresponding histogram is generated over a single independent coding of the lattice. In particular, all textures are N N with N an even integer, and the histogram is generated by looking only at pixels with odd indices. This is in keeping with the restrictions on conditional independence, although it should be noted that if one bases the histogram on the entire texture, then the observed outcomes of the various experiments are substantially the same.
Of course, since we are considering algorithms for the estimation of the texture parameters of an MRF, we will present statistics on these in the tables that summarize each experiment. However, it is important to note that these numbers can be somewhat misleading. What really drives the process of texture formation is the conditional probabilities, and these are related to the texture parameters by an invertible but complex nonlinear transformation. This makes it difficult to establish a meaningful metric in the parameter space, since it is possible to have large variations in a texture parameter that lead to only small variations in the conditional probabilities. Therefore, it is also important to try to measure how well the texture parameters generated by the different algorithms reproduce the correct conditional probabilities. Given a set of texture parameters β and an estimated set $ β , we will consider the following measure: which is the squared distance from the true conditional probabilities to those given by the estimated texture parameters.
In the first set of experiments, we compare three estimation procedures. We note that for all three, we discard any equations for which fewer than five pixels with the corresponding index were observed in the histogram. The three procedures are: 1) The algorithm proposed by Derin and Elliot [1]. This uses the $ estimates, and hence we must also discard all equations for which the estimated marginals are zero or one.
2) The algorithm proposed by Gurelli and Onural [4]. This uses the % estimates for any index with fewer than 12 observations, and the $ estimates otherwise. All equations with 12 or more observations for which the estimated marginals are zero or one must be discarded. 3) In this algorithm, we use the & estimates, so that no additional equations need to be discarded.
The results of this experiment appear in Table 1. Because the three methods discard different amounts of information, we also keep track of N, the number of admissible equations that appears in the least squares problem. It is interesting to see just how few equations are used, on average, to estimate the parameters. In fact, for the 30 30 regions, we find that Method I averages only eight equations and in fact fails roughly 3% of the time because of rank deficiency.
The second experiment looks at using the & estimates and then solving the full unculled least squares problem using the fast algorithm. These results appear in Table 2. In this experiment, we note that although the texture parameter estimates are not as close as in the first experiment, they do exhibit smaller standard deviation.
Moreover, the z error measure appears to be uniformly better, which is our first indication that there may be useful information in the fact that certain indexes are not observed.
The third experiment again uses the & estimates but then considers solution via weighted least squares. For practical purposes, the mean square error values for n = 0, 1, 2, ..., 999 were separately computed and stored in a look-up table, for n > 999, the rational approximation from (14) is used. We consider two variations of this approach: one using the full (or unculled) set of equations and the other using only those equations for which there are observations. These results appear in Table 3. We note that both approaches are comparable: the first having smaller z errors and the second giving slightly better average estimates of the texture parameters. Again it appears that equation culling leads to larger z errors but lower bias in the texture parameter estimates. An excellent analysis of the use of weighted least-squares methods for estimating texture parameters can be found in [6]. They also present results indicating that weighted least squares compares quite favorably with maximum-likelihood methods.

CONCLUSIONS
We have examined the histogram method for estimating the parameters of a Markov random field. It is seen that two of the major steps in implementing this algorithm are to estimate the local interaction sums and to solve a least squares problem. We derived a new estimator for the local interaction sums that was shown to be optimal in the minimum mean squared error sense. Moreover, we saw that this estimate has the proper asymptotic behavior as the number of observations gets large. Finally, unlike the other estimates in common use, this estimate is defined for all possible collections of observations.  ALL INDEXES AND THEN  SOLVING THE FULL SYSTEM (NO EQUATION CULLING) WITH THE FAST ALGORITHM Based on 1,000 trials for each region size. For this experiment, the textures were generated using parameters b 0 = -4 and b 1 = b 2 = b 3 = b 4 = 1.

TABLE 3 USING WEIGHTED LEAST SQUARES WITH THE & ESTIMATES
The unculled version uses the full set of equations, the culled version removes all equations for which there are no observations. Based on 1,000 trials for each region size. For this experiment, the textures were generated using parameters b 0 = -4 and b 1 = b 2 = b 3 = b 4 = 1.
Second, we showed that the least-squares problem can be solved extremely rapidly if no equation culling was required after estimation of the local interaction sums. We also showed how to apply generalized least squares methods to this problem. It was seen that the generalized least squares approach gave excellent estimates even if the size of the sample realization was small.