A Multi-Resolution Hidden Markov Model Using Class-Specific Features

We apply the PDF projection theorem to generalize the hidden Markov model (HMM) to accommodate multiple simultaneous segmentations of the raw data and multiple feature extraction transformations. Different segment sizes and feature transformations are assigned to each state. The algorithm averages over all allowable segmentations by mapping the segmentations to a “proxy” HMM and using the forward procedure. A by-product of the algorithm is the set of a posteriori state probability estimates that serve as a description of the input data. These probabilities have simultaneously the temporal resolution of the smallest processing windows and the processing gain and frequency resolution of the largest processing windows. The method is demonstrated on the problem of precisely modeling the consonant “T” in order to detect the presence of a distinct “burst” component. We compare the algorithm against standard speech analysis methods using data from the TIMIT corpus.


INTRODUCTION
The Hidden Markov Model (HMM) [1] combined with spectral analysis using cepstral coefficients [2] on fixedlength analysis windows remains at the forefront of automatic speech recognition (ASR) technology.One problem with this architecture is the necessity of using a fixed analysis window size.This constraint is a problem because in speech and other natural processes, the various phenomena that are being tested (such as phonemes in speech) may occur with differing time scale.The window size used on speech analysis is a compromise between phonemes with long time scale such as vowels and phonemes with shorter time scales such as plosives.The need for a fixed-size window arises from the fundamental probabilistic approach that underlies the method and depends on the comparison of likelihood functions formed on a common feature space.One could not directly compare two likelihood functions if they are defined on different feature spaces.Even if pains are taken to normalize the behavior of similar features obtained from differingsize data windows, the fundamental basis for comparison is suspect.
With the introduction of the class-specific feature theorem [3], [4], [5], and later the probability density function (PDF) projection theorem (PPT) [6], the freedom now exists to use a different feature set for each class, even for each state in a HMM [7], and as we now show, different analysis window lengths for each state.Thus, the topic of this paper is to apply the PPT to the problem of using varying-size analysis windows within the framework of a HMM.

THE HMM AND MULTI-RESOLUTION HMM (MRHMM) ON RAW DATA
We assume familiarity with hidden Markov models (HMMs).
A good reference is an article by Rabiner [1] from which we borrow notation.If we ignore the effects of overlapped processing, the underlying assumption when a time-series is segmented for processing is that the data in two different segments are conditionally statistically independent (CSI).In other words, the data in two segments are statistically independent conditioned on the system states in the two segments being known.The CSI property enables the efficient calculation of the joint PDF using the forward procedure.Let there be a raw data time-series, denoted by X, consisting of an integer multiple of T samples, where T is the basic time quantization.The traditional approach, which we describe simply as the HMM, is to divide the data into uniform T -sample segments which are to be processed separately.Let x t represent the data in time-step t consisting of data samples 1 + (t − 1)T through tT .In the HMM, it is assumed that: 1. during any T -sample segment, the data is governed by one of M possible states.2. any two samples, no matter how close together , that are contained in two different segments, are CSI.
For the MRHMM, however, we assume that: 1. during any T -sample segment, the data is governed by one of M possible states.2. for each state s , there is an associated minimum time duration.Once the system transitions to state s, it must remain in that state for nK s T samples, where K s is the integer minimum duration parameter for state s, and n ≥ 1. 3. Two data samples x i and x j are assumed to be CSI and are processed separately if (a) the system has made at least one state transition between times i and j, or (b) the system has been continually in the same state s but samples i and j are in different length-K s T segments.Otherwise, samples x i and x j are processed jointly.4. To allow for the system being in a state for a number of length-T segments not divisible by K s , we define a number of slave states, say states s ′ , s ′′ , that are slaved to state s in a way to be described, with At this point, the MRHMM can be seen as nothing more than a HMM with a specially structured π and A. But the more important difference, which we will explain below, is in the way that p(X|Q) is calculated.For the moment, let us talk about our goal.We seek an algorithm to solve the following four problems: 1. Segmentation.Find the most likely trajectory through the trellis subject to the restrictions described above.2. State probabilities.Determine the a posteriori state probabilities γ t,m = p(s t = m|X).This is a more complete de- scription of the trajectories than knowing the single most likely trajectory.

Joint PDF. The joint likelihood function of all the data
given the model is given by where Q is the set of all possible trajectories and P(Q) is the a priori probability of a given trajectory through the trellis.Note that L(X) averages p(X|Q) over all trajectories through the trellis weighted by the probability of the trajectory.Invalid trajectories have zero contribution.4. Re-estimation.We would like to estimate the model parameters from the data.Parameters include π, A, and the parameters θ s of the conditional state PDFs p(x t |s, θ s ).For the HMM, the above problems are solved by the forward procedure and the associated backward procedure and the Baum-Welch algorithm [1].For the MRHMM, we need to adapt these algorithms, not only by structuring the π and A, but by changing the way that p(X|Q) is calculated.We will explain by example.Let the first state partition be length 3 (K 1 = 3) and let the partition for state s = 1 consist of the wait states q = 1, q = 2, and q = 3.Let Q = [4,6,7,1,2,3,4,5,6,7,10,11,12,5,6,7,1,2,3,10] be a particular valid length-20 state trajectory.Being a valid trajectory, wait states q = 1 through q = 3 occur only as part of the sequence 1, 2, 3.Here is the point at which the HMM and MRHMM differ.For the HMM, we have We can gather all the state PDF values into the matrix P t,q = p(x t |q).Then, (2) may be computed by the well known forward procedure [1] operating on P t,q and using parameters π and A. To change the HMM into a MRHMM, we need two steps: Step 1. Partial PDF values.For each valid trajectory Q and each state s, collect all terms in p(X|Q) associated with the wait state sequence for state partition s and replace the terms by the partial PDF value.Define p(x K s t |s) = ∏ K s i=1 p(x t+i−1 |q i ), where q 1 . . .q K s is the sequence of wait states in the state s partition.Define p(x K s t |s) as the partial PDF value (the geometric mean of the PDF terms in the sequence).In the above example, the first occurrence of the wait state sequence 1,2,3 is the sequence of terms p(x 4 |q 4 = 1) p(x 5 |q 5 = 2) p(x 6 |q 6 = 3), which we denote by p(x 3  4 |s = 1).We replace each of the three PDF factors by the partial PDF value [p(x 3  4 |s = 1)] 1/3 .This substitution does not change the value of p(X|Q).
Note that we can accomplish this by changing P t,q .Associated with every possible occurrence of partition s sequence is a diagonal line in matrix P t,q of length K s .The diagonal starts with the first wait state of partition s at any time t and ends with the last wait state of partition s at time t + K s − 1.Each such sequence is replaced with the geometric mean as described.The resulting matrix is called the partial PDF matrix P p t,q .Note that applying the forward procedure to P p t,q gives precisely the same result as P t,q provided π and A reflect the restrictions to state transitions that were described above.Matrix P p t,q if viewed as an image appears to have diagonal "streaks" of constant value.
Step 2. Relax CSI assumption.Associated with each streak is a data analysis window x K s t , which is a segment of data of length K s T samples ending at sample (t + K s − 1)T .The product of each streak of partial PDF values is a PDF of the data analysis window assuming CSI segments.To relax the CSI assumption within the streak, we replace the partial PDF values with the K s root of the analysis window PDF processed as a unit.Let P f t,q represent the "full window" partial PDF values created this way.The value of L(X) calculated by the forward procedure operating on P f t,q changes, however, it remains a valid joint PDF of X.We know this because all we have done is replace the the conditional PDFs P(X|Q) assuming all the segments are independent with another PDF that assumes statistical dependence within the wait state sequences associated with a given state.
At this point we have a raw-data based MRHMM model that we can compute efficiently using the forward procedure operating on P f t,q .To create a feature-based MRHMM model, we need only to apply the PPT.

CLASS-SPECIFIC MULTI-RESOLUTION CLASS-SPECIFIC (CS-MRHMM)
The standard feature-based HMM is the same as the raw-data based HMM with the raw data segments x t replaced by the feature vector With this simple replacement, the forward procedure computes the feature-based likelihood function For the CS-MRHMM, we need to use the PPT to transition to the feature domain.Let be the length KT sample analysis window which starts at sample 1 + (t − 1)T .It includes segments x t through x t+K−1 .
The term p(x K t |s) will be calculated using the PDF projection theorem [6].As we have written several publications on the topic including the tutorial article [8], we describe the method only briefly.Let x be a general segment of raw timeseries data.Let z s = T s (x) be a feature set calculated from x specifically designed for state s.Let p(z s |s) be a PDF estimate of the feature set z s based on training data from state s.The feature likelihood function is projected from the feature space to the raw data by pre-multiplying by the J-function as follows: pp The function pp (x|s) can be regarded as a function only of x by substituting T s (x) for z s and can be shown to integrate to 1 over x (thus it is a PDF).The J-function is a unique function of x determined precisely from the feature transformation T s and the class-dependent reference hypothesis H 0,s : Since J(x; T s , H 0,s ) is determined a priori without regard to training data, it can be considered the untrained part of pp (x|s), while p(z s |s) is the trained part.While it is true that pp (x|s) is a PDF, it is only an estimate of p(x|s).The degree to which pp (x|s) is a good estimate of p(x|s) depends on (a) the accuracy of p(z s |s) and (b) the degree to which z s is a sufficient statistic for the binary hypothesis test between s and H 0,s .In the rare case that z s is in fact a sufficient statistic, the accuracy of pp (x|s) depends only upon the accuracy of the low-dimensional PDF estimate p(z s |s).The J-function takes many forms [6], one of which can be used when z s are maximum likelihood (ML) estimates of a set of parameters.In this case, J(x; T s , H 0,s ) has a simple form based on the Fisher's information matrix [6].

PRACTICAL IMPLEMENTATION DETAILS
Let us briefly review what we have done so far.We have described how to compute the likelihood function of the CS-MRHMM.To do this, we identify every time-shifted analysis window.For each state s and time step t, we identify the analysis window that starts at time step t and is of length K s T samples.On this window, we extract the state-dependent feature set, then use the PPT to compute the raw-data PDF of the analysis window.We then take the K s root of the PDF value and insert this value into the length K s diagonal streak in the matrix P f t,q .Then, we apply the well-known forward procedure using the expanded parameters π e , A e .When K s is large, this requires a highly overlapped set of windows.
The amount of processing required can be mitigated, by recursive processing.For example, the FFT or autocorrelation function (ACF) of a segment can be updated to reflect data that has been shifted out and data that has been shifted in [9].Applying the MRHMM to real data warrants additional details beyond what has been so far described.

States vs. Signal classes
Let signal class refer to a particular signal phenomenon observed in the data.Let signal state refer to an instance of a signal class.In the simplest situation, signal class and signal state are synonymous.But, if a signal class is observed to repeat, additional signal states may be used to represent the additional occurrences.These in turn give rise to additional partitions.

Slave Partitions
We have already introduced the notion of slave partitions (slave states).Up to now, this has only meant the necessity of adding additional states with lower K.To compute L(X) with the forward procedure, there is nothing else that needs to be done.However, to train the parameters, we will need to discuss the process of ganging states.

Training the CS-MRHMM.
In the standard Baum-Welch algorithm for re-estimation of HMM parameters [1], the state feature PDFs for state s are trained by maximizing log-likelihood functions weighted by γ s,t .Since the standard HMM does not differentiate between wait states, we would need to a separate PDF estimate for each wait state.However, for the CS-MRHMM, there are only PDF estimates associated with the initial wait states, the first wait states of each partition.Logically, the CS-MRHMM produces values of γ q,t that are constant in diag- onal streaks in a partition.That is, γ q,t = γ q+1,t+1 if wait states q and q + 1 are in the same partition.Thus, in the CS-MRHMM, each analysis window can be traced to a given constant-valued streak in the γ q,t matrix.When training the CS-MRHMM, the features from the associated analysis window are weighted by the corresponding value of γ q,t in the streak.Training becomes slightly more complicated, however, once we consider slave partitions and if the number of signal states exceeds the number of signal classes.While each partition is associated with a PDF estimate, we may not want all partition PDF estimates to be independent.To remedy this situation, we "gang together" all partitions that associate with a given signal class.To gang partitions, we first create a compressed version of γ q,t , denoted by γ c m,t , which sums γ q,t over all wait states associated with signal class m.Then we then weight an analysis window by the smallest value of γ m,t in the set of time steps t contained in the analysis window.This works very well in practice but is a clear departure from the Baum-Welch algorithm and may produce an algorithm without guaranteed monotonicity.

Efficient Implementation
The number of wait states in the expanded HMM problem can be very large.The forward and backward procedures have a complexity of the order of the square of the number of states.Thus, an efficient implementation of the forward and backward procedures and Baum-Welch algorithm may be needed that takes advantage of the redundancies in the expanded problem.We have obtained a time reduction factor of 42 with a problem that had 7 signal classes and expanded to 274 wait states.The two algorithms were tested to produce the same results within machine precision.

Simulated Data
To illustrate the concepts, we tested the concept of the CS-MRHMM using simulated data.To independent identically distributed (iid) Gaussian noise, we added a low frequency (LF) pulse of autoregressive (AR) process of 128 samples in length with a peak frequency response of 0.4 radians per sample, followed by a random-length gap of at least 256 samples, followed by high frequency (HF) pulse of AR process of 64 samples with a peak frequency response of 1. each corresponding to a signal class : "noise", "LF pulse", and "HF pulse".We used nine partitions including six slave partitions.The elemental segment length was T = 32 samples.There were a total of 25 wait states.Parameters of the nine partitions are listed in table 1. Autoregressive (LPC) features of model order P (see table 1) were extracted by overlapped window processing.A separate feature processor was used for each combination of K and P. Features were shared between partitions that had the same K and P values.Analysis windows were shifted always by the elemental segment length of 32 samples for each update, so the amount of overlap depended on the length of the analysis window.To handle end effects, data was assumed to wrap around in time.
Features were extracted from each analysis window by first taking the FFT, computing the magnitude squared, then computing the inverse-FFT to produce the autocorrelation function (ACF).The Levinson algorithm was used to produce the reflection coefficients of order P. The total power in window is also stored as the P + 1st feature.The J-function [6] is obtained by use of the saddle-point approximation [10].Further details of the implementation details of the AR models can be found in [8].
In Figure 2, we see the partial PDF matrix P f i,m for a typical sample.Wait states q = 1 through q = 15 are associated with the "Noise" signal class.wait states q = 16 through q = 22 are associated with the "LF Pulse" signal class, and wait states q = 23 through q = 25 are associated with the "HF Pulse" signal class.The gamma probabilities are a by-  Note that the wait states for "LF pulse" (q = 16 through q = 19) are clearly seen where the pulse occurs.The same is true of the "HF pulse" event (q = 23 through q = 24) .It is possible to see various competing trajectories through the trellis.Note for example in time steps 43 through 56, the noise gap between the two pulses, the HMM is in the noise signal class.In steps 43-50, it is in partition 1, (wait states q = 1 through q = 8).Then after exiting wait state Figure 4: Example of CS-MRHMM operating on the word "stool".From top to bottom: compressed gamma probabilities γ c m,t , log signal power, and spectrogram.Short analysis windows have been employed for the "T", while longer processing has been used for background noise and the sounds "S", "oo" and "L".The three components of the "T" can be clearly identified.
q = 4, it has located two possibilities to span the six time steps remaining before HF pulse occurs.It can either go into partition 2 (wait states q = 9 through q = 12) then partition 3 (wait states q = 13 through q = 14), or it can choose the reverse, partition 3 then partition 2.
The gamma probabilities can be collapsed to indicate just the signal classes, as shown in Figure 5.The class probabili-  ties (figure 5) is an accurate indication of the true content of the data to a time resolution of T = 32 samples.

Speech Data
We used the CS-HMM to analyze the spoken word "stool" at 16 kHz sample rate.Space restrictions do not permit a detailed description of the experiment.We identified seven signal classes and assigned values of K and P (LPC order) (1) Noise used for both background and the "T" closure: K = 12 or 384 samples, P = 7, (2) "S" : K = 12 or 384 samples, P = 7, (3) "T" Burst : K = 4 or 128 samples, P = 5, (4) "T" Aspiration : K = 8 or 256 samples, P = 6, (5) "oo" vowel part 1: K = 24 or 768 samples, P = 8, (6) "oo" vowel part 2: K = 24 or 768 samples, P = 8, (7) "L" : K = 24 or 768 samples, P = 8.After adding slave partitions, we had a total of 36 partitions and a total of 258 wait states.The expanded STM was 258 by 258.Using efficient programming, neither the partial probability matrix nor the expanded STM actually need to be created.Figure 4 shows the result of analysis of one example with the CS-MRHMM.Important to note is that the three components of the "T" can be clearly seen by observing γ c m,t .

Figure 1 :
Figure 1: Example of spectrogram of synthetic data.The data consists of three signal classes.Class 1 (noise) occurs first, then a low-frequency pulse of duration 128 samples, then noise, then a high-frequency pulse of duration 64 samples.

Figure 2 :
Figure 2: Partial PDF matrix P f i,m showing devisions between signal classes (solid horizontal lines) and between wait state partitions (dotted lines).Higher probability is darker.

Figure 3 :
Figure 3: Wait state probabilities for illustrative example.

Figure 5 :
Figure 5: Signal class probabilities calculated by summing figure 3 over the wait states of each class.
1 , s 2 ...sN ] be a set of state values, where 1 ≤ s t ≤ M, 1 ≤ t ≤ N. We call Q a trajectory because it defines one of the many paths through the state diagram or trellis.There are no restrictions on Q except for those restrictions imposed by the initial state probabilities π = {π m , 1 ≤ m ≤ M}, and the state transition matrix (STM)A = {A l,m 1 ≤ l ≤ M, 1 ≤ m ≤ M}.For the MRHMM, we can encode all the above restrictions imposed on state transitions by properly structuring π and A. For each state s, we can define a partition of states, which we call wait states, of size K s .Let A e be the expanded MRHMM STM and let π e be the expanded set of prior state probabilities.We structure A e so that state transitions into the state s partition are only allowed into the first wait state.From the first wait state, the state is forced to increment to the second, third, ... and finally to wait state K s .From wait state K s , the state is allowed to transition to the first wait state of any state partition.Note that although A e is dimension M e × M e where M e = ∑ M m=1 K m , there are only M 2 free parameters in A e .

Table 1 :
Partition parameters for the illustrative example.K is the partition length in elemental segments.KT is the length of the partition in samples.Parameter P is the autoregressive (AR) model order (same as LPC model order).