Comparison of I-vector and GMM-UBM approaches to speaker identification with TIMIT and NIST 2008 databases in challenging environments

In this paper, two models, the I-vector and the Gaussian Mixture Model-Universal Background Model (GMM-UBM), are compared for the speaker identification task. Four feature combinations of I-vectors with seven fusion techniques are considered: maximum, mean, weighted sum, cumulative, interleaving and concatenated for both two and four features. In addition, an Extreme Learning Machine (ELM) is exploited to identify speakers, and then Speaker Identification Accuracy (SIA) is calculated. Both systems are evaluated for 120 speakers from the TIMIT and NIST 2008 databases for clean speech. Furthermore, a comprehensive evaluation is made under Additive White Gaussian Noise (AWGN) conditions and with three types of Non Stationary Noise (NSN), both with and without handset effects for the TIMIT database. The results show that the I-vector approach is better than the GMM-UBM for both clean and AWGN conditions without a handset. However, the GMM-UBM had better accuracy for NSN types.


I. INTRODUCTION
The I-vector approach has recieved increasing interest for different research fields such as verification, language and emotional recognition. In [1], it was used for robust language identification and verification recognition, while [2] and [3] studied emotion and speech recognition; in [2], the I-vector results were compared with those for the GMM-UBM model. Different speaker identification challenges have been studied, for instance, with increasing numbers of speakers, channel variabilities, and the effects of noise and a handset. In [4], a GMM method was investigated for text independent speaker identification under noisy telephone channels, while elsewhere robust speaker identification in noisy environments has been studied, such as in [5], [6], [7], [8] and [9]. Moreover, [10] focused on the size of the population and the degradation produced from a noisy telephone channel and system, using the TIMIT and NTIMIT databases. However, few studies have involved a handset, AWGN, and NSN types in conjunction with fusion strategies. Handset variability effects for speaker recognition were studied in [11]. Session compensations with the I-vector approach were considered in [12] using Linear Discriminant Analysis (LDA), Nuisance Attribute Projection (NAP) and Within Class Covariance Normalization (WCCN) for text independent speaker identification. Nevertheless, this study lacked a large number of speakers, as only 50 self collected speakers were used. In [13], 1,000 speakers were selected from YouTube to construct an I-vector speaker identification framework, but this non-standard database did not include noisy conditions.
In this paper, we establish two robust text independent closed set speaker identification systems: 1) a new fusion-based I-vector framework innovatively utilizing four feature combinations with two feature compensation methods, Capstrum Mean Variance Normalization (CMVN) and Feature Warping (FW) to the Mel Frequency Cepstral Coefficient (MFCC) and Power Normalization Cepstral Coefficient (PNCC) features. Then four combination of Ivectors are produced: FWMFCC, CMVNMFCC, FWPNCC and CMVNPNCC. It then exploited seven fusion types to yield multidimensional I-vectors, yielding a simple, fast, and efficient ELM classifier to identify the speakers; and 2) an extended evaluation including differnt NSN with/without a handset using score fusion based GMM-UBM. This exploited our previous study which evaluated clean and AWGN in [14]. In addition, late fusion techniques were employed with the GMM-UBM approach using maximum, mean and weighted sum to improve performance accuracy in clean speech, and to mitigate handset and background noise effects. However, in the current work, we make fair comparisons between the I-vector and modified fusion based on the traditional GMM-UBM approaches, using 120 speakers from each of the NIST 2008 and TIMIT databases: in total, 240 speakers with 2,400 speech utterances are employed. This work provides a clean speech evaluation for the NIST 2008 database for the stated approaches for a wide range of Gaussian Mixture Components (GMC). Also, an evaluation is presented for the TIMIT database under clean, AWGN, street traffic NSN, bus interior NSN, and crowd talking NSN with/without G.712 type handset at 16 kHz. This paper is structured as follows: Section II provides the I-vector and GMM-UBM frameworks; Section III gives the experiments and results; Section IV presents the related work; Section V includes the conclusions and the future work.
II. SPEAKER IDENTIFICATION SCHEMES USING GMM-UBM AND I-VECTOR Fig. 1 shows two speaker identification systems using a previous study for the GMM-UBM approach [14] and the proposed I-vector approach. Both systems were trained as in Part A in Fig. 1, and tested as in Part B. The full procedure for GMM-UBM evaluation can be found in [14], whereas, Table  I shows that for the I-vector.

A. Fusion Techniques for Combining I-vector Features
According to [15], the mathematical model for I-vector implementation is explained using: where: u is the given speech utterance, c = (1,..., C) which is the UBM mixture component, C is the number of mixture components and in this work is also denoted as Mix = {8, 16, 32, 64, 128, 256, 512}, F is the dimensionality of the acoustic feature vectors, i is the identity vector (Ivector),T is the total variability matrix, µ is the independent speaker and channel supervector, and S is the speaker and channel dependent supervector. In addition, I is the identity where: k = 1, 2, 3, 4. while, ω 1 , ω 2 , ω 3 and ω 4 = 0.7, 0.77, 0.8 and 0.9 respectively, which have been found to yield a higher identification rate empirically.

B. Identifying Speakers using Extreme Learning Machine
Recently, ELMs have been widely used in fields such as computer vision, biomedical engineering, and control and robotics, because they are simple, efficient and have impressive performance [18], [19] [20] and [21]. ELMs have single layer hidden node parameters which are randomly generated. The number of input nodes is equal to the Number I-vector Dimension (NID) and we used an almost equal Number of Hidden Neurons (NHN); however, we used different numbers when it was necessary to achieve higher performance accuracy. In addition, the number of output neurons is equal to the number of classes, and in our work, 120 classes were used to represent 120 speakers. The ELM algorithm can be summarized as follows. We found the input weights and biases, which were randomly generated, then estimated the hidden layer of the output matrix; finally, we calculated the output weights.

III. EXPERIMENTAL RESULTS AND DISCUSSION
Two databases were exploited in this study: the TIMIT database and the 2008 NIST Speaker Recognition Evaluation Training Set Part 2. We exploited 120 speakers from TIMIT, with a total of 1,200 speech utterances, from which 480 were used for testing, and 720 for training, as in [14]. For the NIST 2008 database, we exploited 120 speakers of English using a microphone channel, and then the sampling frequency was converted from 8 to 16 kHz to mirror the TIMIT database. Only single speakers were selected by deleting the interviewers. In addition, each speech file was divided into ten equal lengths, and six out of ten were used for training (the rest for testing), with a fixed length of eight seconds. In this work, our experiments can be divided into two main parts based on NIST 2008 and TIMIT databases, in terms of the evaluations for I-vector and GMM-UBM approaches.

A. Part A: Simulations for NIST 2008
In this part, clean speech evaluations for the I-vector and GMM-UBM approaches were developed, as illustrated in Table II and Table III. The evaluations show the relationship between the SIA and the GMC, including the Number of Hidden Neurons (NHN), which is equal to Number of I-vector Dimensions (NID). According to both tables, it is clear that the highest SIA for the I-vector outperforms the GMM-UBM approach at a mixture size of 256, with 96.67% compared with 95.83%. However, the I-vector had lower results for small UBM mixture sizes.

B. Part B: The Simulations for TIMIT
The evaluations of the comparison of the TIMIT database for the I-vector and GMM-UBM techniques included various background noise types with/without a handset: clean speech, AWGN Without Handset (WOH), AWGN With Handset (WH), street traffic NSN WOH and WH, bus interior NSN WOH and WH, and finally, crowd talk NSN WOH and WH. In addition, a G.712 type handset at 16 kHZ was used and each simulation was achieved by employing eleven I-vectors based on feature and fusion methods of four feature based FWMFCC, CMVNMFCC, FWPNCC and CMVNMFCC with 100 I-vector dimension. There were seven other fusion methods; Weighted sum, Maximum, Mean, Cumulative I-vectors fusion with d-dimension (100), Concatenated and Interleaving fusion with 2d-I-vector dimension (200), and Concatenated fusion with 4d-dimension (400). In Fig. 2, the simulation illustrates GMM-UBM and I-vector comparisons in clean speech for TIMIT, and the best SIA for each mixture size was selected from both approaches regardless of feature or fusion   Fig. 2; thereby, the mixture size 256 was used for the evaluation for all noise conditions. Fig. 3 and Fig. 4 explain the comparisons for both GMM-UBM and I-vector systems in AWGN, street NSN, bus NSN and crowd talking NSN with/without a handset for a wide range of SNR (0-30) dB. The continuous coloured curves with NSN square nodes for SNR levels represent the I-vector approach, while the dash-dot coloured curves with circle nodes for SNR levels depict the GMM-UBM approach. Furthermore, we used the same colour for the same noise type for both systems. The worst performance was using the AWGN because it has a constant noise spectrum, while bus NSN achieved  less reduction in SIA in the presence of noise, of all other non-stationary noise types. On the other hand, both street and crowd talking NSN were accurate between AWGN and the bus NSN. The relationship between the SIA for both GMM-UBM and I-vector approaches is explained in Fig. 3 and Fig. 4 with different noise conditions with/without the handset.

IV. RELATED WORK
This section summarises the current work on I-vector and GMM-UBM approaches and other related work, alongside our previous work and other state of the art methods [14], [22], [12], [13], [23], [24], and [5]. According to Table IV, the handset used was G.712 type at 16 kHz, and all proposed noise measurements in this table were at SNR 30 dB and mixture size 256. The best results of SIA were for clean speech, and our evaluations included various SNR levels, as explained in Fig. 3 and Fig. 4. Better SIA based on the I-vector was achieved compared with GMM-UBM under clean speech for both TIMIT and NIST 2008 databases. It also outperformed all clean speech measurements for other researchers. For TIMIT, the proposed I-vector approach achieved higher SIA under AWGN compared with the previous study on the GMM-UBM system, are compared with other work; in contrast, our previous work with GMM-UBM had better SIA than the proposed I-vector for AWGN WH, in line with other work. In addition, for non stationary background noise WH, the performance accuracy of GMM-UBM was better than the Ivector at SNR 30dB, but this reversed for some SNR levels. Finally, in [5], it seems the SIA for street noise was higher than in the proposed work, but this was achieved using a different noise database with 630 speakers. V. CONCLUSIONS This paper considered robust text independent speaker identification using I-vector approach for various background noises WH effects. The proposed work is compared fairly with the GMM-UBM approach and evaluated on the TIMIT and NIST 2008 databases for clean speech and also for TIMIT databases under nine different conditions, using both databases, eleven I-vectors together with feature and fusionbased methods. The system for both databases outperformed GMM-UBM techniques for clean speech, and also outperformed in TIMIT database under AWGN WOH, then it seems better for some SNR levels with street and crowd talking. In contrast, for bus interior NSN, the GMM-UBM achieved less reduction in SIA compared with the I-vector approach. Additionally, fusion techniques may mitigate the reduction caused by different noise environments and the handset effect, whereas fusion weights generally seem to be the best of all feature and fusion methods used. In future work, we will also consider a new databases such as The Speakers in the Wild (SITW) Speaker Recognition Challenge database. We will also extend our evaluation of the NIST 2008 to include stationary and various NSN types with handset.