Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Recognition

,

the development of MFCC that emphasizes the human aspect of psychoacoustics. The three types of feature extraction will be tested on Hidden Markov Models to Sundanese speech corpus. The same purpose has been done to the other study, but not to Sundanese speech corpus. Speech recognition is language-dependent, so the system needs to be rebuilt for every language that has never been used. It is the primary motivation why this research is done.

Study Literature
The dataset was tested using three types of feature extraction and Hidden Markov Model. The feature extraction included Linear Predictive Model (LPC), Mel Frequency Cepstral Coefficients (MFCC), Human Factor Cepstral Coefficients (HFCC).

Linear Predictive Model
LPC represents a human voice signal at time n is s (n) as a linear combination of previous human voice signals [8]. It is shown in equation (1).
Steps in the LPC are: a. Pre-emphasis A sound signal that has been converted into a digital signal, s (n), is passed on the low orde filter. The most commonly used pre-emphasis sequence is a first order system. b. Blocking Frame After pre-emphasis, the signal is blocked into parts by specific window size. At this stage, each part of blocking results in the signal overlap each other. It gives the LPC spectrum results that will correlate to each part. c. Windowing It is done to minimize discontinuity at the beginning and end of the signal. The most commonly used window model for LPC model with autocorrelation method is Hamming Window. d. Autocorrelation Analysis Each part has been given a window then be formed its autocorrelation by using equation (2).
where m = 0, 1,2, ..., p. The p is the highest value of the autocorrelation and also the LPC orde. The typical values of the LPC analysis orde are between 8 and 16. The advantage of using autocorrelation methods is that value to zero, r (0), is the energy of the signal is made the autocorrelation. a. LPC analysis All the autocorrelation values that have been calculated in the previous stage will be converted to an LPC parameter. These parameters are varied; they are called LPC coefficients, cepstral coefficients, or other desired transformations. A standard method for solving the autocorrelation coefficients into LPC coefficients is the Durbin method. b. Converting LPC parameters to cepstral coefficient The critical LPC parameters that could be derived from the LPC coefficients are its cepstral coefficient, c (m). Itl is the coefficient of the Fourier transform representation on the logarithmic spectrum.

Mel Frequency Cepstral Coefficients (MFCC)
It can be used as a vector of useful features to represent the human voice and musical signals. It adopts the human auditory system, where the voice signal will be filtered linearly for low frequencies (below 1000 Hz) and logarithmically for high frequency (above 1000Hz). Analysis on Mel-frequency applies some filters at a specific frequency, as happened in the human hearing system. The filters have a non-uniform spacing on the frequency axis. It causes many filters on the low-frequency region and a little on the high- The MFCC calculations use the necessary calculation of short-term analysis. It is done considering the quasi-stationary voice signal. Tests which conducted for short enough period (about 10 to 30 milliseconds) show the stationary characteristics of the sound signal. However, if it is done in a more extended period, the characteristics of the sound signal will change according to the spoken word. MFCC method has several stages: a. Preprocessing Preprocessing on MFCC includes framing and windowing. Human voice signals include unstable signals. However, we can assume it as a stable signal on a time scale of 10-30 ms. The framing serves to cut the sound signal with a long duration becomes shorter duration. It obtains the more stable characteristics of the sound signal. The windowing process aims to reduce the occurrence of spectral leakage or aliasing. The problem is an effect of the emergence of new signals that have a different frequency with the original signal. These effects can occur due to low sampling rate or due to the framing process that causes the signal to be discontinuous. b. Discrete Fourier Transform (DFT) To get a signal in the frequency domain of a discrete signal, one of the Fourier transformation method used is the Discrete Fourier Transform (DFT) [11]. DFT is performed every 10ms on the signal. c. Mel-Frequency Wrapping The Mel-Frequency scale is a linear frequency below 1 kHz and logarithmic above 1 kHz. Mel scale can be obtained using equation (3).
where B is the Mel-Frequency scale, and f is the linear frequency. d. Cepstrum Mel-Frequency Cepstrum is obtained from DCT (Discrete Cosine Transform) to regain the signal in time domain. The result is called Mel-Frequency Cepstral Coefficient (MFCC). MFCC can be obtained from equation (4): It is the result of the accumulation of quadratic magnitude DFT, multiplied by the Mel-filter bank. After that, it got MFCC. In speech recognition, usually only 13 first coefficient cepstrum is used.

High-frequency Cepstral Coefficients
HFCC is the development of MFCC [12]. The main thing of HFCC is also as an artificial classifier. This method explicitly applies Moore and Glasberg's Equivalent Rectangular Bandwidth (ERB) as part of a filtering mechanism where ERB by equation (5).
is the frequency with units of kHz. HFCC use more than one factor so that is more secure than noise.

K-means Clustering
Clustering classifies data with the same characteristics into the same region and data with different characteristics to the others [13]. K-Means Clustering is one simplified method based on the mean value of each cluster [14]. Every clustering objects are seen from a distance with the midpoint of the closest. After knowing the midpoint of the closest, the object will be classified as a member of that category.
The algorithm is as follows: a. Determine the number of clusters b. Assign data to clusters randomly c. Calculate the centroid/average of the data in each cluster d. Assign each data to the nearest centroid/average e. Return to step 3, if there are data which move the other cluster or the value of the objective function above a specified threshold value The distance between data and centroid is commonly calculated based on Euclidean Distance.

Hidden Markov Models
Hidden Markov Models set parameters which are hidden from observation parameter. Every state in HMM has a probability distribution over the output symbols that might appear. From a series of symbols generated by HMM, it can provide information about the sequence or order state.
HMM has the following notations: 1. N=Number of states in the model.

How to get HMM parameters,
( ), so the value of P(O | λ) is maximal. The first problem can be handled by using the forward algorithm, and for the third problem can be solved using the Baum-Welch algorithm. Forward algorithm is an efficient recursive algorithm to calculate P(O | λ). It is defined as a chance state i at time t using forward algorithm. The algorithm is described in equation (6).
Baum-Welch algorithm has a function to train the initial model of HMM by estimating the parameter for model ( ). For t = 0, 1, … T-2 and i, j * + , it defines as shown in equation (7).
It is the probability of being in state q when the time t and move to state q j at t + 1. as can be written as The relationship between ( ) and ( ) is shown in equation (9).
The estimation process is an iteration process. The estimation process is described as follows: 1. Initialization of ( ) 2. Calculation of ( ) ( ) ( ) ( ) 3. Estimation of model ( ) If the value of P(O | λ) increases then the system repeats the process on point 2.

System Design
The dataset comes from four different people. The recording is done in the soundproof room to avoid noise that appears during the process. Every person says the numbers 0 to 9 of 60 times. After that, the data is divided into two parts. 33% of the dataset used as test data while the rest as training data. The system consists of several stages that are shown in Figure 1: a. Pre-processing Input from the system is wav file. This format is part of the Microsoft RIFF specification used for storing multimedia files. It starts with the header section and is followed by a chunk data sequence. Also, it consists of three parts, namely main chunk, chunk format, and chunk data. The sound signal represented in the discrete form, a series of numbers representing amplitude in the time domain. In the header file, there is information about the WAV file which includes the information about of sample rate, and bits per sample, number of channels. Preprocessing aims to adjust the input system to be processed at later stages. The two primary processes which occur during pre-processing are centering and normalization. b. Feature Extraction It is the process of determining a value or vector that can be used as the object identifier. Three methods used in this research are Linear Predictive Coding (LPC), Mel Frequency Cepstral Coefficients (MFCC), and Human Factor Cepstral Coefficient (HFCC) c. Vector Quantization (VQ) Vector quantization is the encoding process of the signal vector into some symbols [16]. It consists of two processes. The first process is learning to get the codebook/centroid/cluster centers. The second is a testing process that transforms data into a symbol feature extraction results based the obtained codebook. In this study, K-Means Clustering has a role to do this process. d. HMM Re-estimation At this stage, the training data is processed to produce a model that represents the ten digits by forwarding and backward calculation algorithm. e. Prediction All models were evaluated at the HMM re-estimation stage using the test data. A model that has maximum likelihood become the prediction label. f. Evaluation Analysis of the performance of each feature extraction on Sundanese data is evaluated using F-measure. F-measure is a test parameter based on a combination of precision and recall [17].

Results and Analysis
Evaluation of the three feature extraction based on the effect of changing the number of clusters and hidden states used.

Analisis Cluster
By using the K-Means Clustering, the influences of the clusters number on the three types of feature extraction are shown in the table. The analysis is done by altering the clusters number. The hidden states used is only five and constant. According to Table 1, 32 clusters had a better performance than the 16 clusters. It indicates that the phonemes of digit consist of different units with large numbers. If the unit was only represented in 16 different clusters, then several different units had the same cluster. However, 64 clusters caused the worst system performance. It means that the biases occur in similar units because they were in different clusters.
However, unlike the case with the 128 clusters, with specialization so sharp then the units were defined very differently. If on 32 clusters, similar units were still considered one cluster but with 128 clusters separated in some different clusters. Also when it compared with 64 clusters, the definition of the difference made the bias that did not occur in the 128 clusters. It was why the performance of 128 clusters was best. It found that its performance was equal by comparing the three feature extractions used in this study. The experiments show that although all three have different types of features, these features have the same value differentiating factor in classifying digits.

Analysis of Hidden States
The second experiment aims to analyze the effect of the number of hidden states on HMM. Implementation is done by using 128 clusters which were the best performance in the first experiment. Table 2 shows that the best performance is obtained when the hidden states are as five, but the worst are nine states. The increase in the number of hidden states has no trend. With the increasing number of hidden state used, the system more adjusts the correlation parameters between hidden states. Consequently, there is no significant difference with the many or few hidden states used. Increasing the number of hidden states did not always cause the performance of the system. On the other hand, the performance of the three feature extraction also has the same performance because the value of each feature has a high similarity.

Conclusion
Based on Table 1 and 2, it can be concluded that: a. The performance of the three feature extraction used in this study has the same performance for Sundanese speech recognition b. The use of 128 clusters has the best performance so that the distinctive units of the phoneme can be well separated. Use of too little cluster has a worse performance due to the different units included in the same cluster. In this study, however, the use of 64 clusters had the worst performance due to bias. c. There is no trend in changing the number of hidden states. It shows that trials need to be done to obtain optimal conditions. For further research, dataset enlargement is required so that the benefits of each feature extraction can be seen more clearly.