Fuzzy Recursive Least-Squares Approach in Speech System Identification: A Transformed Domain LPC Model

ABSTRACT


INTRODUCTION
In speech processing, the parameterization of an analog speech signal is an important step, as the resulted parameters should represent the salient spectral energies of the sound [1]. Since the linear predictive coding (LPC) model provides a good approximation to the vocal tract spectral envelope in such a way that the parsimonious representation of the vocal tract characteristics becomes possible [2], the LPC model is the most common model used in speech spectral analysis. By changing spectral analysis in waveform data interval to spectrographic time-frequency domain where the information (such as inter-formant energy fill) can significantly be portrayed [3], the coefficients of LPC model prove its contributions in the application of speech signals synthesis [4]. In most recent decade, the LPC model has been implemented in various applications such as long term recordings of electromyography signals [5], recognition of Malayalam vowel [6], clustering of microarray genetic data [7], reconstruction of missing electrocardiogram signals [8], dynamic texture segmentation of image sequences [9], and classification of human activity based on microdoppler signatures [10]. All of these applications have proven that the LPC model could be implemented in a more general approach.
However in speech processing, the accuracy of LPC model often depends on the number and quality of past speech samples that are fed into the model. Study shows that, the reasonable number of past speech samples that is fed into the LPC model to approximate current speech sample depends on the sampling rate of with the addition of two to five past speech samples [11]. Typically, for speech sample, the number of past speech samples, is equal to 10, where the additional two past speech samples are used to formulate the glottal flow and the radiation added during the pronunciation of speech [2]. If is as low as two, although the synthetic speech is still intelligible, yet it is poor in quality [4]. This shows that, in order to have a high quality of synthetic speech, the number of past speech samples that is fed into the LPC model must be sufficiently large so that the coefficients produced are able to characterize the salient spectral energies of the sound.
From the formulation of LPC model, past speech samples form a linear combination with the LPC coefficients to approximate current speech sample. For such, the resulted LPC coefficients aim to minimize the prediction error at every time instant . In literature, fuzzy system is often employed together with the applications of LPC model to achieve certain objectives. For example, in voice over internet protocol system, evolving Takagi-Sugeno fuzzy model is used to recover the missing linear spectral pairs that calculated from the LPC model [12]; in cancer classification, modified fuzzy c-means algorithm (fuzzy clustering) is used to classify the features that are extracted using LPC model [13]; and in speech coding transmission, fuzzy clustering is used to cluster voiced segment where these segments are to be transmitted together with the extracted LPC coefficients [14]. Notice that, fuzzy system is only applied afterwards and the LPC coefficients are still used in these applications.
In this paper, the fuzzy system is integrated directly into the LPC model using recursive leastsquares (RLS) approach. Instead of directly feeding the past speech samples into the LPC model and solve for the LPC coefficients, we transform the LPC model into fuzzy domain and use fuzzy parameters to characterize the given speech samples. It has been proven that the fuzzy basis function [15] and its reduced form [16] are universal approximator. After a brief description of the formulation of LPC model in Section 2, the transformed domain LPC model with fuzzy recursive least-squares approach (FRLS-LPC) will be formulated in Section 3. In Section 4, based on the configuration of system identification, simulations are performed on the real speech samples and the performance of FRLS-LPC model is evaluated in terms of the prediction error and the quality of synthetic speech using fuzzy parameters. Finally, conclusion is drawn in Section 5.

LINEAR PREDICTIVE CODING
Given a speech sample at current time instant , where , the LPC model suggests that the current speech sample can be approximated as a linear combination of past speech samples such that , (1) where are the LPC coefficients for to past speech sample, in which are assumed to be constant over the speech analysis frame with speech samples. The objective function of LPC model is given by where is the error of the approximated speech sample to the actual speech sample , is the column vector that consists of all the past speech samples, and is the column vector that consists of all the LPC coefficients. To minimize , then (2) is differentiated with respect to and equating the result to zero; thus the optimum LPC coefficients can be obtained such that where is the covariance matrix and is the cross-correlation vector between the past speech samples and the current speech sample. Here, the LPC coefficients are used to model the characteristic of vocal tract, where these coefficients are further to be processed in the application of LPC model (e.g. speech synthesis).

TRANSFORMED DOMAIN LPC MODEL: FUZZY RECURSIVE LEAST-SQUARES APPROACH
From (1), the LPC model is a model of approximate current speech sample with the linear combination of past speech samples. Instead of feeding directly into the LPC model, these past speech samples are transformed to fuzzy inputs, where each past speech sample is an element in the universal of discourse which has a degree of membership in a particular fuzzy set. Solving it in the transformed domain of fuzzy system, these fuzzy inputs form a linear combination with fuzzy parameters to approximate current speech sample, . Redefined as the transformed domain LPC model, current speech sample can be approximated such that , (4) where are the fuzzy parameters, and for are the fuzzy inputs (i.e. the fuzzified to past speech samples) that corresponds to fuzzy rule. Employing the reduced fuzzy basis function [16] as the fuzzy inputs, then where is the degree of membership of that corresponded to the Triangular-shaped membership function, at fuzzy rule. For notation simplicity, lets denote (5) as , thus (4) can be rewritten as , (6) where current speech sample is approximated by the linear combination of fuzzified past speech samples and fuzzy parameters.
Let be the column vector that consists of all fuzzified past speech samples with respect to fuzzy rule, and be the column vector that consists of all fuzzy parameters. By introducing an exponential weighting forgetting factor, where the information of distant past has a lesser effect on the coefficient updating, the objective function is given by where here is the error of the approximated speech sample in the transformed domain of fuzzy system to the actual speech sample . Following the recursive least-squares (RLS) approach to solve for the fuzzy parameters, then the optimum fuzzy parameters can be obtained in a sequential recursive format such that , (8) where is the gain vector in which is the correlation matrix of fuzzy inputs, and is the a priori error. Here, the LPC model is transformed into fuzzy domain, and the LPC coefficients are replaced by the fuzzy parameters. Instead of the LPC coefficients, these fuzzy parameters can be used for further processing in the application of LPC model. The formulations from (4) to (8) define this transformed domain LPC model with integrated fuzzy system, and it is known as the LPC model with fuzzy recursive least-squares approach (FRLS-LPC model).

SIMULATION
In this section, the LPC model and the FRLS-LPC model will be evaluated in terms of prediction error and the quality of synthetic speech using extracted coefficients. Based on the configuration of system identification, the coefficients of the model can be identified using the RLS approach. As the name linear predictive suggests, the extracted coefficients of the model form a linear combination with past speech samples to approximate the current speech sample at a given time instant .

845
In this simulation, a speech with the pronunciation of a word: Malaysia is used. The pronunciation of Malaysia is /ma'-lei-zi-a/ and it contains 6000 speech samples with the sampling rate of . Figure 1 shows the waveform and spectrogram of Malaysia.

Measurement Criteria
There are two measurement criteria to evaluate the performance of the model: (i) the squared error in Decibel scale, and (ii) the spectrogram of the synthetic speech. Using the extracted coefficients and errors produce by (8), the synthetic speech is constructed and compare with the original speech to produce the and the spectrogram. The first measurement criteria of prediction error is given by where is the base 10 logarithm scale, and ̃ is the approximated speech sample at time instant using past speech samples (i.e. ̃ for the LPC model and ̃ for the FRLS-LPC model). This is to compare the waveform difference between the original speech and the synthetic speech. Besides (9), the spectrogram of synthetic speech will be compared with the spectrogram of original speech in order to illustrate the efficiency of the extracted coefficients in representing the energy fill of the speech samples. Figure 2 shows the partitioned spectrogram of Malaysia. comprises of semivowel, /l/ and two front vowels, /e/ and /i/. Due to the transitional nature of /l/, it often influences by the vowel that follows, thus the energy fill is concentrated at high and low frequency as the nature of both front vowels. For partition c, the energy fill is concentrated at high frequency due to the voiced fricative, /z/ and the front vowel, /i/ in which produce excitation and high frequency resonances. For partition d, it is obvious that the energy fill is concentrated at middle and low frequency due to the mid vowel, /a/. Finally, energy fill hardly can be seen at partition e as it is the ending of the speech. For further spectrogram analysis, readers are advised and refer to [2] for more detailed illustrations and explanations.

Limited Past Speech Samples
In this simulation, the LPC model is tested with 10 past speech samples (which is suggested in [2]), and also with the limitation of only 2 past speech samples (the minimum number of past speech samples in order for the synthetic speech to be intelligible [4]). While for the FRLS-LPC model, only 2 past speech samples are used and each of them is fuzzified with three Triangular-shaped membership functions. The parameters for Triangular-shaped membership function are chosen accordingly based on the range of past speech samples to obtain reduced fuzzy basis function [16]. At every iteration (i.e. at each time instant ), only a limited number of past speech samples will be fed into the model (i.e. only until are used to approximate ). For such, both LPC models are with and , while for the FRLS-LPC model is with . Figure 3 shows the waveform, learning curve of error and spectrogram of the synthetic speech. are at around -30dB; while for the FRLS-LPC model, although with only 2 past speech samples being feed into the model, the is at around -350dB and far lower than both LPC models. In terms of the waveform, the FRLS-LPC model closely resembles the original speech waveform and its shape is more similar than those waveforms resembled by the LPC models.
In terms of the spectrogram analysis, it can be clearly seen that the LPC model with is similar to original speech), while it is yellow for the LPC model with 10 past speech samples. Besides, at upper right corner of partition e, the energy filled is dark blue for the FRLS-LPC model (which is similar to original speech), while it is light blue for the LPC model with 10 past speech samples. The spectrogram analysis shows that the distribution of synthetic speech energy fill of the FRLS-LPC model is clearer and more detailed than the LPC model with 10 past speech samples (not to mention that the LPC model with just 2 past speech samples).

Corrupted Past Speech Samples
In order to push the test further, white Gaussian noise is added to the past speech samples where the signal-to-noise ratio (SNR) drops to 10dB. These corrupted past speech samples will be fed into the model to test the ability of such model in extracting the coefficients with the presence of measurement noise. For such low SNR, the corrupted speech is a noisy sound rather than a proper pronunciation of Malaysia. If the SNR is higher, then the pronunciation of Malaysia can be heard, together with some background noise.
For the LPC models, 2 and 10 corrupted past speech samples will be fed into the model; while for the FRLS-LPC models, only 2 corrupted past speech samples will be fed into the model. Using the extracted coefficients, the synthetic speech here is the reconstructed speech from noise to resemble the original speech. Figure 4 shows the waveform, learning curve of error and spectrogram of the synthetic speech using corrupted past speech samples with only 10dB SNR.  Figure 4, even with the limitation of such corrupted (10dB SNR) 2 past speech samples, the FRLS-LPC models are outperformed both LPC models in estimating the current speech samples. For both LPC models, the synthetic speech is failed to reconstruct the original speech using corrupted past speech samples of only 10dB SNR. Although the of both LPC models are at around -30dB, the waveforms are totally unrecognizable. From the spectrogram analysis, the synthetic speech of both LPC models is a noisy pronunciation rather than a proper speech. Although the LPC model with 10 past speech samples is able to reconstruct some of the low frequency energy fill, yet the overall energy fill is far different from the original speech. For both LPC models, this simulation is also conducted with the corrupted past speech samples of 30dB SNR: the synthetic speech becomes better where the pronunciation of Malaysia can be heard with some For both FRLS-LPC models, the is at around -40dB for three membership functions and at around -60dB for six membership functions. From the waveforms of synthetic speech using FRLS-LPC models, the shapes are closely matched to the waveform of original speech, especially the one with higher number of membership functions. From the spectrogram analysis, the distribution of synthetic speech energy fill is clearer and comparable to the original speech although it is roughly in shape. With higher number of membership functions to fuzzify the corrupted past speech samples, the quality of synthetic speech becomes better. It can be seen at the spectrogram of synthetic speech using FRLS-LPC model with six membership functions: the low frequency energy fill at partition a, the low and high frequency energy fill at partition b, the high frequency energy fill at partition c, and the low and middle frequency energy fill at partition d are matched with the original speech. Playing it as audio, the pronunciation of Malaysia can be heard clearly. In this simulation, it is obvious that the FRLS-LPC model outperformed the LPC model. Even with only 2 such corrupted past speech samples of 10dB SNR, the FRLS-LPC model is able to reconstruct the speech such that it closely resembles the original speech; compared with those using LPC models are failed.

CONCLUSION
In this paper, fuzzy system is directly integrated into the LPC model using recursive least-squares approach to create the FRLS-LPC model. In this transformed domain LPC model, fuzzy parameters are used to approximate the current speech sample, in which its performance depends on the fuzzy rules and membership functions rather than on the number of past speech samples. Although the computation of fuzzy inputs requires additional computational cost compared to the LPC model, however the results are significant enough to make a trade off. Simulation shows that although with limited number of past speech samples fed, the synthetic speech obtained by the FRLS-LPC model is far better than those of the LPC model which have sufficient number of past speech samples; in terms of prediction error and spectrogram analysis. The simulation is even tested with corrupted past speech samples, and the result shows that even with such low SNR model fed, the FRLS-LPC model has proven its performance to resemble the original speech. Both simulations show the viability of FRLS-LPC model in such constricted condition; while the LPC model is underperformed. Since the performance of FRLS-LPC model depends on the fuzzy rules rather than the number of past speech samples, the fuzzy parameters extracted using FRLS-LPC model can be an alternative to the LPC coefficients in the application of LPC model.