Which is the Better Entropy Expression for Speech Processing :-S iog S ’ or log S ?

The aliasing that results from time-sequential sampling of spatiotemporal signals is strongly dependent on the order in which the spatial points are sampled. To design sampling patterns that reduce aliasing, the sequence of sampling points is mapped into several shorter subsequences via the Chinese remainder theorem. A painvise exchange algorithm then finds the best ordering of each subsequence. The patterns obtained with this procedure perform substantially better than those known previously, and perform as well as the optimal patterns that can be expressed in closed form when the signal is termporally undersampled by less than a factor of 2. M


I. INTRODUCTION
ANY spectral analysis techniques start with measured values of the autocorrelation function R ( t ) of a signal at a set of points. One class of techniques proceeds by extrapolating R(t) to reasonable values at the unknown points. The extrapolated autocorrelation function is equivalent to a power spectrum estimate, since the power spectrum S(f) of a bandlimited stationary process is related to its autocorrelation function by a Fourier transform.
Perhaps' the best known extrapolation technique is Burg's maximum entropy spectral analysis (MESA) [ 11 , [2] , in which the power spectrum S(f) is estimated by maximizing subject to the constraints R, = R(t,.) = 1 : S(nexp(2nitrndf (2) where W is the bandwidth and where R(t,), r = 1, 2, . . , M , are known values of the autocorrelation function. The MESA estimate of S(f) has the well-known all-pole, autoregressive, or linear prediction form, which can also be derived by various equivalent formulations [3] - [6]. It has become one of the Manuscript received February 24, 1983;revised July 29, 1983. The authors are with the Computer Science and Systems Branch, Information Technology Division, Naval Research Laboratory, Washington, DC 20375. most widely used spectral analysis techniques in geophysical data processing [7] -[9] and speech processing [4] , [lo] .
"Maximum entropy spectral analysis" is also used in image processing. In that field, howeve'r, the phrase refers not only to successful estimates produced by maxfiizing ( 1) [ 191 , the success of (3) in image processing raises the question of whether (3) might also be useful in speech processing. We consider the question in this paper, and we attempt to answer it. As part of our investigation, we also derive a generalization of the estimate produced by maximizing (3), one that takes into account a prior estimate of the unknown power spectrum.
Our'paper is organized as follows. In Section I1 we review derivations of the forms (1) and (3), and we discuss theoretical arguments for each of them. We then turn to an empirical comparison. Our approach is discussed in Section I11 and the results are summarized in Section IV. A general discussion then follows in Section V.

BACKGROUND
In this section we give brief derivations of the spectral estimators that result from maximizing (1) and (3). We show that both estimators 'result from applying the principle of minimum cross-entropy [20] - [24] , a generalized form of the principle of maximum entropy [25] -. [27]. However, they differ concerning the quantities that are treated as random variables. In the case of (l), the underlying random variables are the coefficients of a Fourier series model and the spectral powers S(f) are expected values. In the case of (3), the spectral power S(f)-suitably normalized-is treated as a probability density and the underlying random variable is the frequency.
Cross-entropy minimization estimates an unknown probability density qt(x) from a prior estimate p ( x ) and known expected values Jq t (x> dx = Fr (4) U.S. Government work not protected by U.S. copyright (r = 0, . . , M). The estimate is obtained by minimizing the cross-entropy subject to the constraints (4) and The resulting estimate of q t (x) has the form [20] , [22] , [28] where the p, and h are Lagrangian multipliers determined by (4) and (6). Cross-entropy minimization reduces to entropy maximization when the prior p(x) is uniform.

A . The -log S Form
In deriving MESA, Burg's approach was to extrapolate R(t) in a manner that maximizes the entropy of the underlying stochastic process [I] , [2] . This is an application of the principle of maximum entropy [24] - [27] . The expression (1) is the entropy gain in a stochastic process that is passed through a linear filter that converts white noise to a process with power spectral density S ( f ) (see [9, pp. 412-4141, [29, pp. 93-95], [30, p. 2431 ). This suggests that the process entropy can be maximized by maximizing (1)  For example, entropy is mathematically ill-behaved for continuous densities [32, pp. 31-32].
A derivation of MESA without these drawbacks arises as a special case of minimum cross-entropy spectral analysis (MCESA) [31] , and also helps to expose the difference underlying the choice of maximizing Like MESA, MCESA is an information-theoretic extrapolation of R(t), but it differs from MESA in that it accounts for a prior estimate of S(f) [or R(t)] . In deriving MCESA, we lose no generality by considering time-domain signals of the form (1) or (3).
where ak and bk are random variables and where the& are frequencies [31], [33, p. 361. Since the power at frequencyfk is X k (ai t b i ) , we describe the random process by a joint probability density qt (x), where x = x1 , x2, . . . , X N .
The spectral power at frequencyfk ofqt(x) is the expectation Let Pk be a prior estimate of S , . Then it is appropriate to assume N 1 p(x)= -exp k = l pk as form for the prior estimate of the probability density q t [3 11 .
Suppose that one obtains new information about q t in the form of M t 1 values of the autocorrelation function R (t,), where r = 0, . . * , M and to = 0. Using (10) one can write this in the form of expected value constraints (4) on q t (x). Given the prior (1 1) and these constraints, one can compute a minimum cross-entropy posterior estimate q(x) of the form (8). The corresponding posterior estimate of the power spectrum is just Sk = J dr Xkq(x), which becomes where the & are chosen so that the Sk satisfy the autocorrelation constraints (12) (with s k in place of s,$) [31] . If one assumes a flat prior estimate of the prior spectrum, Pk = P, and equal spacing of the autocorrelation lags, tr = rat, (13) can be written in the form (8) [31] .
The posterior probability density can be expressed in terms of the posterior spectral power estimates (1 3) Computing the normalized differential entropy of the posterior power estimates (1 3) yields Except for the constant, which has no effect on maximization, the right-hand side of (1 5 ) is the discrete form of (1). Maximizing (1 5 ) subject to the autocorrelation constraints leads again to (8).

B. The -S log S Form
In this case we treat the unknown power spectrum variables S,$ as probabilities, which is mathematically reasonable provided that the power spectrum is normalized so that x,$,$ = 1.
The known autocorrelations are then expressed as expectations of the probability distribution Ski, k = 1, . . . , N , as in (1 2).
In deriving the resulting power spectrum estimate, we again proceed with the general case involving a prior estimate and cross-entropy minimization. Since we assumed a known autocorrelation for lag to = 0, &$ = k R o is known. Let pk be a prior estimate of Skf , and let q t = {sl, t t 4 2 , . . * , 4$ } andp = ( p l , p z , . . . , pN } be probability distributions defined by normalizing the power spectra S2 and Pk, i.e., q,$ = (2S,$)/R0 and Pk = Pk/T, where T = &Pk. We rewrite the autocorrelation constraints (12) as expectations of qP: and we obtain a posterior estimate 4 of q t by minimizing the cross-entropy H(q, p ) subject to the constraints (16) (with 4k in place of 42). Note that the constraint for r = 0 reduces to the normalization constraint &4k = 1. The result is where the p,. are chosen to satisfy the constraints. We define the posterior power spectrum estimate as Sk = y.oqk, which satisfies (12). This yields 1 where the g. are e ual to the pr in (17) except for cpo , which satisfies cpo = po t y10g(R0/2T). Since 9 N s k 1%sk --1 hROH(q,p) %RO l o g z RO k = l pk it follows that minimizing H ( q , p ) is equivalent to minimizing so that minimizing (19) subject to the constraints (12) yields (18). For a flat prior estimate Pk = P , minimizing (19) is equivalent to maximizing which is just the discrete form of (3).

C. Discussion
Both estimates of S,$ proceed from a prior estimate Pk and known autocorrelations R,.. When the coefficients in an underlying Fourier series model are treated as random variables and the S i are treated as expectations, cross-entropy minimization leads to the estimate (13). For the case of a flat prior estimate P, = P, (1 3) follows from maximizing &log&. When the S$ are treated as probabilities rather than expectations, crossentropy minimization leads to the estimate (18), which also follows from maximizing -XkSklogSk in the case of a flat prior. Because the result in this case arises from performing maximum entropy on a probability distribution defined by normalizing a power spectrum, we refer to it as maximum en-tropy normalized spectral analysis (MENSA).' The Lagrangian multipliers P, in (1 3) and qr in (1 8)  Which of the two estimates (13) and (1 8) is better? In our opinion, if one has a good physical model for some variable of interest, and if the model can be incorporated into the derivation of an estimate for that variable, it makes sense to do so.
Because such estimates can exploit more information than estimates derived without an underlying model, estimates based on underlying models should be better. Since (13) exploits an underlying Fourier series model-well known to be a useful model for time series-this point favors (13). Also, since (13) yields all-pole models in the important case of flat priors, since all-pole spectra result from passing a broad-band signal through a multilayered transmission medium, and since the human vocal tract is a multilayered transmission medium, it follows that (13) should be appropriate for speech processing.
On the other hand, arguments for (18) also have merit. For example, in arguing for the maximization of -&&logsk rather than &log&, Skilling [ 161 points out that the goal is to estimate the power spectrum itself, not the Fourier amplitudes in an underlying model like (9), so that a more direct and better estimate should result from treating the unknown power spectrum variables $ as probabilities. Mathematically, this is reasonable provided that the power spectrum is normalized so that &!$ = 1. Furthermore, speech spectra are known to have occasional zeros, and the form of (18) shows that small values for s k can arise from moderate values of the trigonometric polynomial in the exponent. The MESA estimate is well known to have difficulty estimating zeros. An additional reason to consider (18) seriously is its success in other fields, which we have already mentioned. Furthermore, spectral estimates based on the minimization of (19) have been reported recently in [34] and a first-order approximation of the estimate (18) appears to be equivalent to the PDFT estimator introduced in [35], [36].
These arguments do not clearly favor one estimator or the other. While the success of (13) in speech processing is strong evidence, it seems clear that the potential of (1 8) will continue to be raised, so that an empirical evaluation is necessary. This we attempt to do in the remainder of this paper.

EXPERIMENTAL APPROACH
This section contains basic definitions, a discussion of our experimental approach, and a discussion of various computational issues. Our general approach is to process various speech This somewhat contrived. acronym has the additional virtue of being the Latin word for "table," which is the source of the Spanish word for table (mesa). signals in order to compare measured power spectra and autocorrelations with MESA and MENSA estimates. We also synthesize speech using both MESA and MENSA power spectrum estimates and perform qualitative comparisons of the results.

A . Definitions and Notation
Let y = Cy1, y z , . . . , yt } comprise L time-domain samples, equispaced at intervals of At, from one "frame" of speech data.
From y , we compute estimated autocorrelations R E This is a biased estimate but it guarantees positive-definite-  When we refer to more than one speech frame, we add a subscript to the foregoing definitions.

B. What and How to Compare
In order to compare MESA and MENSA, we did three things. 1) For a variety of representative speech frames, we plotted A , A * , and R and compared them. 2) For the same frames, we plotted S, S*, and Q and compared them.
3) We compared speech synthesized two different ways: we used identical pitch and voicing decisions, and either S or S* for spectral shape. All of these comparisons were qualitative.
What about quantitative comparisons? For some distortion measure d, one could compare d(Q,

S) with d(Q, S*)
, but what should d be? One distortion measure could yield d(Q, S) < d(Q, S*) while another could yield the reverse inequality.
One reasonable choice is the Itakura-Saito distortion dIs [37] , which is known to be useful in speech processing. But, in the notation of Section 11, the Itakura-Saito distortion dIs (S, P) is just the asymptotic cross-entropy N ( q , p)-derivations of MESA spectra by cross-entropy minimization are equivalent to derivations by minimization of dIs [lo] , [31] , [38] . Not only does S minimize dIs (S, P) subject to the constraints, but S is the spectrum of the form (13) that minimizes dIs(Q, S ) [37], [39]. Use of dIs might therefore involve an intrinsic bias in favor of MESA. We therefore consider a distortion measure that bears a relation to MENSA analogous to that of d,, to MESA. Define the "cross-entropy distortion" dcE (Q, S ) to be the cross entropy of the probability distributions obtained by normalizing Q and S j = 1 j = 1 Then S* minimizes dCE(S*, P) subject to constraints just as S minimizes dIs(S, P) subject to constraints. Moreover, S* is one of the spectra of the form (2 1) that minimizes k E (Q, S*) [22] . We also use a third distortionmeasure, the gain-optimized Itakura-Saito distortion defined by dGo (Q,S) = min, d I s (gQ, S ) , where g ranges over positive constant scale factors [39] . This is closely related to dIs but, like dcE, is insensitive to changes in the gains of the two spectra. It can be computed from the formula

C. Numerical Issues and Procedures
The MENSA estimate S* can be produced by an algorithm that determines minimum cross-entropy probability distributions given arbitrary priors and arbitrary constraints [22], [40]. For the work reported here, we used a Fortran version of the Newton-Raphson based APL program described in [40] .
The resulting spectrum S* may be thought of as a discretefrequency approximation to a continuous power spectrum. Clearly, the accuracy of the discrete-frequency approximation will depend on the number of frequency points N .
As for S, a variety of methods are available. Standard MESA or LPC methods can produce the a, used in (8) or any of the equivalent sets of parameters such as reflection coefficients. The result is a continuous representation of the spectrum estimate thai can then be evaluated at the frequenciesfk in order to yield S. This is more accurate than methods that compute discrete-frequency approximatidns, but to use it might introduce a misleading source of differences between S and S*. We therefore chose to compute the S in a manner analogous to the computation of S*. In particular, we used a Fortran implementation of the MCESA [3 11 algorithm described in [41] , which uses the Newton-Raphson method to compute (13) for arbitrary priors and autocorrelation constraints. For a flat prior, the result is just a discrete-frequency approximation to a continuous MESA or LPC spectrum. As checks on the discrete-frequency computations of S* and S, we obtained results for various values of the number of frequencies N , and we compared the results for S with continuous frequency results obtained using Levinson recursion.
To obtain synthetic speech using the two different spectral shapes, we used commonly-available, LPC-based programs. Our procedure was as follows. First we analyzed the test sentence for pitch and voicing using a modified cepstral technique described in [42] and implemented in Version 4.0 of the Interactive Laboratory System (ILS) from Signal Technology, Inc. The results were used for both syntheses. For the synthesis based on.S*, we used a 29th-order all-pole approximation to the power spectrum S* in each frame. This approximation was computed by taking the first 29 lags of the autocorrelation extrapolation A* and using Levinson recursion to yield a set of reflection coefficients. As checks, we plotted the resulting approximate power spectrum and compared it with S*. For the synthesis based on S, we followed the analogous procedure-we ran Levinson recursion on the first 29 lags of A . Had we been dealing with exact, continuous spectra, the resulting "approximate" spectrum would be exactly equal to S, so it would have been reasonable to bypass this step. We included it, however, in order to keep the comparison as fair as possible. As a check, we also synthesized speech using spectra obtained directly from Levinson recursion on the first M + 1 lags of the measured autocorrelations R . Note that the 29thorder all-pole synthesis spectra are 29th-order approximations to S and S*, and not 29th-order approximations to Q.

IV. EXPERIMENTAL RESULTS
We obtained results for the sentence "The meeting begins at four P.M ." The sentence was spoken by a male, passed through an antialiasing filter, digitized at 8000 samples/s, and divided into 100 frames of 180 samples each. Using 256 discrete frequencies ( N = 256), we computed Ri, Qi, y, A;, 9, and Ai, did computations for some cases with N = 64 and N = 128. In general, there were no essential differences between results for .. N = 64 or 128 and N = 256. We also repeated the computations using Hamming windowing alone, 90 percent preemphasis alone, and both together on the digitized speech. In the following, we focus attention on two frames-frame 56, which contains a portion of the phoneme /f/, and frame 39, which contains a portion of the phoneme /I/. We refer to these frames j = 1 , ... , 100, as discussed in the previous section. We also by means of the subscripts f and I , respectively. Unless windowing or preemphasis is explicitly mentioned, the reference is to the spectra computed without preprocessing.

A . Comparison of Autocowelation Extrapolations
In Fig. 1, we plot Rf, A?, and Af for N = 256. When we plotted the continuous autocorrelation function obtained by Levinson recursion, it was indistinguishable from Af, which implies that the discrete frequency approximations are accurate. Beyond the constraint limit of lag 10, the extrapolations A; and Af differ from each other as well as from R . One would be hard pressed to argue that either one is a "better" extrapolation. The same conclusion follows from Fig. 2, in which we plot RI, A;, and AI.

B. Comparison of Power Spectra
Turning to the power spectra, we plot s"f, Sf, Sf, and  $ and SI are quite different. In particular, $ has deep nulls that are characteristic of the MENSA estimates for the whole test sentence. Indeed, the frequent occurrence of five lobes is obvious in Fig. 7, which shows the superimposed results of S* for all 100 frames ( N = 256), confirming an earlier conjecture about spectral zeros. No such structure occurs for S (Fig. 8).
The lobe structure appears to be related to the number of constraints: there are five lobes in Fig. 7, which is one half the analysis order ( M = 10). We repeated the computation of A* using M = 25 and M = 8. The resulting plots were similar to Fig. 7 except that about 12 and 4 lobes were apparent, respectively. Neither preemphasis nor windowing was entirely effective in eliminating the deep minima from the MENSA spectra.
The superposed plots continued to show a lobed structure, although more complex and less regular than the consistent fivelobe pattern of Fig. 7. The results of using both Hamming windowing and 90 percent preemphasis are shown in Fig. 9. In Fig. 10, we compare the ''actuaY power spectrum Qf with S; and S;. Both estimates appear to be smoothed versions of Q,-. Fig. 11 shows the analogous comparison for /I/. Here   Table  I. In one case the mean distortion for MENSA is slightly less than that for MESA, the difference being in the third decimal place. In every other case the mean distortion for MESA is less. This is true even for the "cross-entropy" distortions d c~, which might have been expected to favor MENSA. The ~C E results do not favor MESA as overwhelmingly as those from the other two distortion measures-especially dIs. The enormous values of d I s for the MENSA spectra are the result of the deep minima. The other two distortion measures contain the term QJS: only logarithmically. Thus, dIs penalizes underestimates more severely than do d~o and d C E . .Two columns of the table are identical: it appears that drs (Q, 5') = d~~ (Q, S). This is no coincidence, but a property of dIs and dGo. The equality can be shown to-hold provided t h a t ' s is a MESA spectrum and that Q is a spectrum that satisfies the same autocorrelation constraints that determine S. A proof can be based on the "correlation matching" property [22], [39] of MESA spectra.

C Comparison of Synthetic Speech
Although results such as Fig. 7, Fig. 11, and Table I suggest that S is better than S*, they are hardly compelling. This is a case where the proof must be in the hearing. Consequently, we synthesized the entire test sentence using the 29th-order LPC approximations as discussed in Section 111-C. The 29thorder LPC approximations to s"f, Sf, JT , and SI are also plotted in Figs. 3-6. The two curves are indistinguishable in Figs. 3 , 4 , and 6; the only discrepancy is for S;; (Fig. 5 ) . In that case, the 29th-order approximation is unable to match the deep nulls and also exhibits some peak splitting. The standard LPC speech and the speech based on S sounded identical, adding further confidence to the discrete frequency approximations. The versions based on S and S* sounded different, but-somewhat to our surprise-we and others judged them to be equally intelligible. There was, however, a distinct qualitative difference when preemphasis was not used. The speech based on S* was qualitatively inferior-it had a distinct ringing quality, as though spoken from the other end of a long, wide pipe. When preemphasis was used, alone or with Hamming windowing, the ringing quality was greatly reduced or effectively eliminated. Hamming windowing alone reduced the ringing only slightly. We hypothesize that this ringing effect is a reflection of the characteristic lobe structure and deep minima of the spectral estimates S*, since the ringing is most prominent when the lobing is most prominent and regular. However,. the ringing can be almost imperceptible while lobing is still plainly visible in spectral plots.

V. CONCLUSIONS
Based primarily on the results of speech synthesis, but also on results like Fig. 7, Fig. 11, and Table I, we believe that it is fair to conclude that MESA ( S ) yields better power spectrum estimates for speech processing than does MENSA (S*). In 1968 he joined the Naval Research Laboratory, Washington, DC, where he is currently Head of the Computer Science Section in the Computer Science and Systems Branch of the Information Technology Division. His previous research interests include computer architecture, dynamic memory allocation, programming language design, software engineering, and text-to-speech translation. His current interests include information theory, queuing theory, system modeling,pattern recognition, spectrum analysis, and speech processing.
Design of Antialiasing Patterns for Time-Sequential Sampling of Spatiotemporal Signals JAN P. ALLEBACH Abstract-The aliasing that results from time-sequential sampling of spatiotemporal signals is strongly dependent on the order in which the spatial points are sampled. To design sampling patterns that reduce aliasing, the sequence of sampling points is mapped into several shorter subsequences via the Chinese remainder theorem. A painvise exchange algorithm then finds the best ordering of each subsequence. The patterns obtained with this procedure perform substantially better than those known previously, and perform as well as the optimal patterns that can be expressed in closed form when the signal is termporally undersampled by less than a factor of 2. M I. INTRODUCTION ANY signal processing and communications problems involve time-varying images. To be processed digitally, these signals must be sampled in space and time. It is common practice to do the sampling in a time-sequential fashion, collecting a frame of samples one-by-one from the spatial region and then repeating this process. The scanning action may be generated electromechanically or by multiplexing the outputs from an array of sensors. Manuscript received February 14, 1983;revised June 22,1983 With some systems such as sensor arrays, the spatial points may be sampled in any order. With other systems, the ordering may be partially constrained by the scanning mechanism. In either case, the points are most frequently taken in lexicographic order which in 2 spatial dimensions results in line-byline scanning. Anumber of researchers have experimented with other orderings of the spatial points [ l ] - [5]. In particular, Deutsch [ l ] , [ 3 ] proposed an ordering which tends to distribute the samples taken during any time interval of duration less than the frame period uniformly over the spatial region. Since the ordering may be generated by a mapping from the bit reversed output of a binary counter, we refer to it as the bit reversed sampling pattern. Fig. 1 shows the lexicographic and bit reversed sampling patterns for one spatial dimension. During each frame period of duration B , M samples are taken uniformly over the spatial region at interval X. With either pattern, we would expect to resolve signal components with temporal frequency fo < I / (28) and spatial frequency uo < 1/(2X). With the bit reversed pattern samples taken during a time interval B/X, 1 < X <M are distributed at approximately a spatial interval of AX. With this pattern, we might expect to also resolve signal components with fo < A/(2B) and uo < 1/(2hX). As h increases, we trade spatial resolution for temporal resolution.