Lipreading Recognition Based on SVM and DTAK

To enhance recognition accuracy of isolated words identification with small samples in lipreading, SVM is first introduced to act as classifier in this paper. As SVM is based on structural risk minimization, it solves the problem of pattern recognition under small samples, on the other hand, it avoids the unreasonable hypothesis in traditional classifier. To meet the requirement of fixed input feature dimensionality in SVM, several input feature dimensionality normalization methods were discussed and compared. including 3-4-3 data segmenting method, HMM based method and DTAK(Dynamic Time Alignment Kernel) based method. Two experiments were performed on the bimodal database, In the first experiment different input feature normalization algorithm were compared on SVM. Experiments showed that DTAK based normalization achieved the best result. in the second experiments SVM was compared with HMM under different number samples occasion. Experimental results showed that SVM performs better than HMM under small samples.


INTRODUCTION
Lipreading is usually used to improve speech recognition accuracy in noise environment. And classification model used in speech are often introduced to lipreading directly. including DTW(Dynamic Time Warping), HMM(Hidden Markov Model),ANN(Artificial Neural Net), TDNN(Time Delay Neural Network) etc [1]. HMM is the most popular classification model in speech because of its excellent handling time-varying feature sequence capability. But there exist some unreasonable hypothesis in HMM, for example: each observed value is regarded as irrelevant, meanwhile, the training criterion which is based on the maximum likelihood is not optimal. Although ANN doesn't have there presupposition, it's based on Empirical Risk Minimization(ERM), the solution may be local optimum, moreover, there exists over-fitting in ANN. In comparison, Support Vector Machine(SVM) is based on Structural Risk Minimization(SRM) [2], making a comprise between model complexity and learning capability, so in theory, SVM is optimal in classification.In general, It has these merits: first, As sample number is often limited, SVM can get optimal solution with these limited samples based on SRM theory. On the contrary, traditional learning theory demand samples must be infinite. secondly, In theory, SVM is Global Optimization compared with traditional classifier; finally, with the help of building linear transformation function in high dimension space, SVM can solve non-linear separation problem. in low dimension space. As experiments in lipreading are all based on database. the sample number is often limited. And inadequate for trainning HMM parameters. So SVM is introduced as classification model in lipreading.
In section 2 SVM theory is analyzed and several feature dimension normalization methods are discussed in section 3. section 4 perform lipreading recognition experiments, including mutil-class classifier trainning method, And HMM is used to compare with SVM. section 5 concludes this paper with a summary and a brief discussion about existing problems and future work.

A. Rational of SVM Classification
SVM is actually a trainned two class classifier, it can be used to judge the category of input feature vector X between two categories. The decision function is equation (1).
Where ( ) x φ is a non-linear function,which can project X to a new high dimension feature space, and realize linear separation in this new space. W denotes the hyper-plane in high dimension space. b is threshold value. SVM is excellent for its learning capability, As traditional classifier aims at realizing ERM. If the trained classifier based on ERM is too simple, generation will be bad, if this classifier is too complex, there exists over-fitting. The solution to this phenomenon is SRM. And SVM aims at solving this problem based on following formula: Subject to the constrain: ,the optimal separation plane is as follows: According to equation (6), those i x corresponding to nonzero i α are called support vector, In general, ( ) x φ needn't to be known in advance. In the non-linear separation cases, Radial Basis Function(RBF) is used in this paper, so the final decision function is as follows:

B. The Rational of Multi-class Classifier
The rational of SVM mentioned above is only for two class. As to multi-class classification, there are two common technique in practice: The first is "one-versus-all" [3]. k classifier would be trained for k class problem. To identify an unknow sample, choose the class which classifies the test datum with greatest margin.
The second is "one-versus-one" [4]. In this case, k(k-1)/2 SVM have to be created for k class samples. To classify an unknown sample. choose the class that is selected by the most classifier. While this involves building k(k-1)/2 classifier, training classifier may actually takes less time compared with "one-versus-all", since the training data set for each classifier is much smaller. so this method is more popular in large datum case.

A. Feature Dimension Normalization Based on DTAK
Standard SVMs expect a fixed-length feature vector as input, but speech is dynamic and this always leads to variablelength features. Different approaches have been presented to deal with the variable time duration of the acoustic speech units. In general, there are two methods:one is normalising the feature vector time dimension to fit the SVM input; the other one is normalising kernel to adapt the SVM to variable input feature dimension.In the first case, there are many existing methods. For example, In [5]two methods of sequence resampling are assessed. One is variable window size method which makes it possible to include the whole digit utterance for a given number of windows per digit by adjusting the size of the window to the digit duration. The other is fixed window size method which maintains the window size around a fixed number of analysis instants regardless of its coverage of the word, therefore, some information is missing for those of long duration. In [6]a kind of linearly elongating or compressing the feature sequence duration method is used. In [7] input feature dimension is divided based on 3-4-3 ratio, this method neglect the paragraph information existing in different words. In [8] HMM is introduced to perform segmenting data based on Viterbi algorithm.
Those methods mentioned above are all based on the concept that speech is composed of some isolated states. Their difference lies in using different segment data methods. In contrast, We have already mentioned the notion of kernel in SVMs, in fact, SVMs are examples of the more general class of kernel methods. that is, SVMs rely on a kernel to obtain a nonlinear decision function. This kernel defines the space in which the solution is sought and therefore its choice is problem-dependent.Kernel function is the important character of SVM, decision function can be obtained by the kernel (inner product in feature space), Shimodaira first proposed to embed Dynamic Time Alignment into kernel function [9], and this kernel is often called DTAK. Subsequently, the lipreading feature of variable dimensionality can be input to SVM directly. And standard trainning and recognition algorithm in SVM can be used.
Traditional DTW aims at looking for a shortest path (alignment function) between the tested feature and the stored template based on elongating or shortening time axis.In contrast, DTAK manage to look for the shortest path based on the accumulated similarity between the tested feature and the template.
Assuming X,Y are two to be comparaed feaure of different dimensionality, ( ) I k ψ and ( ) I k ψ is time-alignment function at instant k on two time axis respectively. Inner product or kernel function can be used to act as a kind of measurement between X and Y.
Subject to the constrain: ( ) m k is a non-negative scale factor. It stands for the weight value of different path, Q is a constant for limiting succession.
The solution to this optimizing problem can use dynamic programming algorithm, recursion formula is as follows: ( 1, ) ( , ) max ( 1, 1) 2 ( , 1) Then inner product can be expressed as equation (10): The above inner product analysis is performed on feature with different dimensionality. if inner product is replaced by non-linear function, such as RBF kernel and embed into SVM, DTAK function is realized as follows: and the decision function can be expressed as equation (12): B. Feature Value Scaling LIBSVM tools is used in this paper [4], because too small value would be submerged under large scope data occasion.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
To test SVM in lipreading,we made a bimodal database which contain 10 Chinese words,each word repeated 40 times. in our experiments, using "leave-one-out"trainning method,that is, only one samples of each word was tested, the remains were used for training.  Experimental results show that DTAK normalization method achieve the best recognition accuracy, in contrast, 3-4-3 method is inferior to other two methods. Apparently, segmenting data based on 3-4-3 ratio may be unreasonable. That is, the obtained feature maybe does not reflect the real paragraph information exactly existing in speech.

B. Comparison between SVM and HMM
To compare the classifying accuracy of SVM+DTAK with traditional classifier, HMM is introduced to compare with SVM. the HMM comprise 6 states and each state comprise 1 Gaussian mixtures. The tested results is depicted In figure 2, it shows that when trainned samples is little,for example only 5 samples of each word, SVM recognition accuracy is 84%. In contrast, HMM recognition accuracy is only 75%. And this reflect the learning and generalizing capability of SVM under small samples occasions. But when trainned samples were increased to 150, HMM is becoming more better than SVM. And with the increase of trainned samples, HMM and SVM both approach 90% accuracy. The experiments showed that HMM require large enough samples to train its parameters, and SVM performs better than HMM under small samples circumstance.

V. CONCLUSION
In this paper, SVM is first introduced to act as classifier in lipreading, to meet the fixed feature dimensionality requirement, DTAK based method was discussed, meanwhile, several traditional feature dimensionality normalization methods were also compared, DTAK method achieved the best accuracy. To test the generalizing capability of SVM, HMM was used as comparison. the experiment show that SVM performs better than HMM under small samples.