On the Comparison of Line Spectral Frequencies and Mel-Frequency Cepstral Coefficients Using Feedforward Neural Network for Language Identification

ABSTRACT


INTRODUCTION
There are about 7105 living languages owned by 6.7 billion populations in this world [1] and these languages definitely differ from each other.Many researches have been conducted in the area of language identification system (LID).A tutorial on LID has been presented in [2] in which syntactic, morphological, and acoustic, phonetic, phonotactic, and prosodic level information have been discussed in details.Around 87 prosodic features has been used for LID system in [3] which provides better recognition performance, while [4] utilizes visual features with error rate less than 10%.In [5], a highly accurate and computationally efficient framework of i-vector presentation is proposed for rapid language identification.A hierarchical LID framework is proposed in [6], in which a series of classification decisions is performed at multiple levels with individual languages identified only at the final level.
Although many researches have been conducted on LID, but most of the researchers are only identifying around two to three languages.Therefore, in this paper, five languages including Arabic, Chinese, English, Korean, and Malay, spoken by both males and females will be analyzed.For LID system, the most used features is Mel-Frequency Cepstral Coefficients (MFCC) and Line Spectral Frequencies (LSF) [7]- [9].Systematic experiments will be conducted to find the optimum parameters.The combination of both LSF and MFCC features along with various structures of feedforward neural networks will be evaluated.The performance criteria used is mainly the recognition rate, as well as the neural network training time.

LANGUAGE IDENTIFICATION SYSTEM
Language identification system contains at least three basic blocks, including preprocessing, feature extraction, and classifier.Preprocessing is a process of speech signal refinement.The raw speech signal that we obtained is not proper to use directly as input.The weak signal that we obtained has to be amplified, removed the longer silence, and also extracted the background noise or music for further processing.There are many feature extractions that can be used for LID system, in order to extract the speech signal from each different speaker of different language, for example Line Spectral Frequencies (LSF), Mel-Frequency Cepstral Coefficients (MFCC), Shifted Delta Cepstra (SDC), Perceptual Linear Prediction (PLP), Dynamic Time Warping (DTW), and Bark Frequency Cepstral Coefficients (BFCC).There are a few classifiers that can used, including Vector Quantization (VQ), Gaussian Mixture Model (GMM), Support Vector Machine (SVM), Ergodic Hidden Markov Model (HMM), K-Means Clustering Algorithm and Artificial Neural Network (ANN) [2].In this paper, two most popular audio features will be evaluated, including LSF and MFCC, and feedforward neural network will be used as the classifier as shown in Figure 1.

Line Spectral Frequencies
A widely used source-filter model of speech is the linear prediction coefficient (LPC) model.LPC models are used for speech coding, recognition and enhancement.A LPC model with order p can be expressed as shown in Eq. ( 1).
(1) where is speech signal, is the LP parameters and is speech excitation.Note that, the coefficients model the correlation of each sample with the previous samples whereas models the part of speech that cannot be predicted from the past p samples.
The line spectral frequencies (LSF) is an alternative representation of linear prediction parameters.LSFs are used in speech coding, and in the interpolation and extrapolations of LP model parameters, for their good interpolation and quantization properties.LSFs are derived as the roots of the following two polynomials as shown in Eq. ( 2) and (3).
(2) (3) where is the inverse liner predictor filter and .The polynomial equations (Eq.( 2) and ( 3)) can be rewritten in the factorized form as shown in Eq. ( 4) and ( 5). are the LSF parameters.It can be shown that all the roots of the two polynomials have a magnitude of one and they are located on the unit circle and alternate each other.Hence, in LSF representation, the liner predictor coefficients is converted to LSF vector .Matlab implementation function lpc() and poly2lsf() were used for this purpose.

Mel-Frequency Cepstral Coefficients
Mel-Frequency Cepstral Coefficients (MFCC) are computed using a filter bank of filters ( ), each one has a triangular shape and is spaced uniformly on the mel scale using Eq. ( 6).Each filter is defined as in Eq. ( 7). ( 6) (7) The log-energy of mel spectrum is calculated as: (8) where is the output of discrete Fourier Transform (DFT) of the input signal.Although traditional cepstrum uses inverse discrete Fourier transform (IDFT), MFCC is normally implemented using discrete cosine transform as follows: (9) Typically, the number of filters ranges from 20 to 40, and the number of coefficients is 13.

Feed Forward Neural Network Classifier
In artificial neural network, the basic processing unit is a perceptron.A feedforward neural network organizes perceptrons into a layer, cascade these layers into a network, and the connections between layers follow only one direction.The layer that receives connections from the input feature vectors is the input layer, the outermost layer is the output layer which is the classifier output, and the rest of the layers between the input and output layers are called hidden layers.The computation of a feedforward neural network or multilayer perceptron can be described as follows (10) where is the output vector of layer wherevbnm, is the number of layers in the neural network.
is the input, while , , and are the weight matrix, the bias vector, and the activation function of layer .In classification of classes, the activation function is normally a sigmoid for or softmax function for .Given a set of samples and a feedforward neural network with initial parameters (characterized by weight matrices and bias vectors), we would like to train the neural network so that it can learn the mapping.If we see the whole network as the following function (11) and define some loss function , then the goal of training our network becomes minimizing .The gradient of indicates the direction to increase as follows where is the index of an arbitrary sample, is the number of classes, is the -th column corresponding to the probability of class of vector .The gradient components of the output layer can be computed directly, while they are harder to compute in lower layers.Normally, the current gradient is calculated using the error of the previous step.Since errors are calculated in the reverse direction, this algorithm is known as backpropagation.

RESULTS AND DISCUSSION
This section will discuss the language database preparation, experimental setup, various experiments to find optimum parameters, and the performance evaluation of the proposed LID system.

Experimental Setup and Language Database
A high performance system was used for processing, i.e. a multicore system with Intel Core i7 6700 K 4.00 GHz (4 cores with 8 threads), 32 GBytes RAM, 256 GBytes SSD and 2 TBytes hard disk, installed with Windows 10 operating system and Matlab 2017b with Signal Processing and Neural Network Toolboxes.During simulation, other running applications were minimized as much as possible.
For the language database preparation, audio file of ten speakers with different language were taken from online language database.There were six males and four females of speakers that will be used as subject for this project.All the speakers were divide into two group for training (four males and one female) and testing (two males and three females) respectively.Besides, each of the speaker spoke different languages and sentences such as Arabic, Chinese (specifically Mandarin), English, Korean and Malay.The database presented in [10] was used with some rearrangement, in which 15 files were used for training and 5 files were used for testing.

Experiments on Sampling Frequencies, Frame Sizes, Model Orders, and Feedforward Neural Network Structures
There are many parameters which could be optimized to achieve the highest performance, i.e. in terms of language recognition rate.In this paper, several important parameters will be analysed, including (sampling frequency), (frame size), (model order), and the structure of feedforward neural networks.The structure of feedforward neural networks could be varied in terms of number of hidden layers and number of nodes in each hidden layer.Note that, a 50% overlapping windows was used for both LSF and MFCC feature extraction so that both will have the same number of frames for each audio file.In [8], we used non overlapping window for LSF feature extraction.
Our previous researches have reported that sampling frequency has an effect on the recognition rate [10], while it has negligible effect on the other [8].Therefore, the first experiment will vary the sampling frequency, i.e. 8000 Hz and 16000 Hz.For this experiment, the other two parameters were fixed as follows, , ms.While the structure of the feedforward neural network was fixed to have one hidden layer with 20 nodes.Table 1 shows the recognition rate versus training time for two sampling frequencies, i.e. 8000 and 16000 Hz.Based on Table 1, the recognition rate for 16 kHz sampling frequency is higher than 8 kHz sampling frequency, especially for LSF features.Therefore, the 16 kHz sampling frequency will be selected as one of the optimum parameter.The next experiment will evaluate the effect of varying window size (frame size) to the recognition rate.For this experiment, the other two parameters were fixed as follows, , .In addition, the structure of the feedforward neural network was fixed to have one hidden layer with 20 nodes.
Figure 2 shows the results of recognition rate and training time for various frame size from 10 to 100 ms.The red line represents LSF, while the blue line represents MFCC.The square marker represents the recognition rate (see the left axis), while the triangle marker represents the training time (see the right axis).Based on Figure 2, the frame size of 30 ms was selected due to it provides reasonable training time and recognition rate.The frame size of 50 ms was another good candidate, however, larger windows size tends not to capture enough the dynamic of speech signals.One can argue that the neural network training plays more significant role to the recognition rate.

Figure 2. Recognition Rate for Various Frame Sizes
The subsequent experiment will evaluate the effect of varying model order of LPC and MFCC to the recognition rate.For this experiment, the other two parameters were fixed as follows, , .In addition, the structure of the feedforward neural network was fixed to have one hidden layer with 20 nodes.Figure 3 shows the results of recognition rate and training time for various model order of LPC and MFCC from 6 to 48 with interval of 2. Based on Figure 3, the model order of 42 was selected as one of the optimum parameter as it provides high recognition rate for both LSF and MFCC.Furthermore, the neural network training time is not that affected by the increment of model order.
The last experiment is regarding the neural network structure configuration.The feedforward neural network with various structure of hidden layer(s) was used.Number of epoch was set to 1000, number of maximum validation fail was set to 100, and the scaled conjugate gradient was used as the training algorithm.Table 2 shows the recognition rate and training time for various structure of neural network, i.e. one hidden layer with a number of nodes [ , two hidden layers , and three hidden layers .The Matlab patternnet () function was used with hidden layer(s) variation.Note that, our preliminary results using learning vector quantization (LVQ) neural network as in [8] is not as promising as simple feedforward neural network with various hidden layer configuration.Moreover, LVQ requires longer training time as well compared to the simple feedforward neural networks.From Table 2, it is found that the optimum number of hidden layer is one hidden layer, while the number of nodes is 1000 as highlighted in bold.The neural network structure provides a high recognition rate with acceptable training time.The other structure, i.e.
, is one of the good candidate as well, but the training time is more than three times longer compared to .

Experiments on the Optimum Parameters on the Training Data
From the previous experiments, the optimum parameters and neural network configuration are Hz, ms, , and feedforward neural network with structure of hidden layer.The neural network will be trained using 15 files for each of the 5 languages.Moreover, as the number of frames for LSF and MFCC is now the same as both using 50% overlapping window, we combined both features to evaluate whether the recognition rate is higher or not.In this experiment, to allow longer training time we further change the number of epochs to 1000, maximum validation fail to 1000, and minimum gradient to . Figure 4 and Table 3 shows the training performance for LSF, MFCC, and combined features.Note that, input layer of Combinednet is the addition of input layer LSF and MFCC, i.e. 84.It can concluded that the combined features will contribute to the recognition rate, while LSF is the dominant feature.

Experiments on the Testing Data
The last experiment would be to evaluate the trained neural network on the unknown or testing data, i.e. 5 files for each languages.We have trained the feedforward neural network to classify the current frame into 5 trained languages.At the end, we need to decide the identified language for the whole file and not the current frame.For this purpose, we utilized the majority voting rule as explained in [11], in which the identified language is the majority voting in that particular file.Table 4 shows the recognition rate for each language per frame and per file after majority voting.Note that, although it has lower recognition rate per frame but sometime it as 100% recognition rate when it is calculated per file using majority voting, vice versa.
The detailed analysis revealed that for English language, it has been wrongly classified as Malay language for 1 file and 2 file using LSF and MFCC, respectively.For Malay language, it has been wrongly classified as Malay language for 1 file using combined LSF and MFCC.Interestingly, the combined features mostly improved the recognition rate except for the Malay language.Further experiment is required with additional database, especially for English and Malay language to validate the obtained results.From the average of recognition rate, it has been found that using LSF features alone is sufficient for language identification.

Figure 1 .
Figure 1.Proposed Language Identification System Eng & Comp Sci, Vol. 10, No. 1, April 2018 : 168 -175 170 where (12) Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752  On the Comparison of Line Spectral Frequencies and Mel-Frequency Cepstral... (Teddy Surya Gunawan) 171 Since the gradient specifies the direction to increase , at each step parameters will be updated proportionally to the negative of the gradient (13) where .The training procedure is called gradient descent, and is a small positive training parameter called learning rate.Cross entropy error is normally used as a loss function (14)

173 Figure 3 .
Figure 3. Recognition Rate for Various Model Orders

Figure 4 .
Figure 4. Neural Network Structure and its Performance for LSF, MFCC, and Combination of LSF and MFCC

Table 2 .
Experiments of Feedforward Neural Network Structures

Table 3 .
Performance for LSF, MFCC, and Combined Features

Table 4 .
Recognition Rate for Each Language on the Unknown/Testing Data