Allowing good impostors to test

Biometric testing should attempt to report unbiased, real-world system performance, especially when tested on limited databases. Though testing on a standard database, such as the Linguistic Data Consortiums's YOHO, allows comparison of speaker verification systems, it is well known that certain procedures bias the results low. One such procedure concerns the use of cohort or reference speakers to perform verification, where the cohort speakers are removed as candidate impostors. A method of testing is proposed to remove this bias by modifying the cohort set for each false acceptance test. Results statistically differ for this modified approach, which tries to "best" model the general population with a fixed random sample. Lastly, three techniques to bound the biometric performance, using both parametric and non-parametric resampling is demonstrated.


Introduction
Biometric systems often report their performance based on many different testing methodologies. In ICASSP-95, Campbell [ll recommended test procedures for the LDC YOHO Speaker Verification database. These procedures included intra-gender testing and the use of 5 cohort or reference speakers for score normalization. Campbell described the bias dilemma, hypothesizing that reported results will be biased low (decreased False Acceptance rates) when either excluding or allowing cohort speakers. Cohort speakers normalize the test score to remove various word effects of the utterance and have been shown to greatly improve verification accuracy [2]. This article first examines an alternative testing methodology to remove this bias from the reported %ual %or Rate (EER) or Zero Rejection Rate (ZRR) performance of a hidden Markov model speaker verification system. Next, we attempt to provide methods for biometric performance characterization. Knowing that a single result is merely a random variable, we bound the EER with confidence intervals. This paper is organized as follows: first a brief description of the speaker templates and cohort selection and normalization procedure is documented. Next, we demonstrate statistically the bias encountered in standard test plans. Finally, we demonstrate techniques to bound Fqual Error rate performance using resampling and compare to parametric methods.

HMM Speaker Verification System
The Linguistic Data Consortium's (LDC) YOHO Speaker Verification database is the only large scale, scientifically controlled and collected, publically available database for speaker verification which allows testing at high confidence levels. The 106 male subjects are examined in this paper, each having provided 96 enrollment utterances and 40 test utterances over a 3-month interval. The speech material consists of "combination-lock" phrases. An example prompt is: "57 -26 -64", pronounced "fifty-seven, twenty-six, sixty-four". The total number of words is sixteen producing 56 possible doublets and a list of 166,320 phrases.
Front end analysis of the raw YOHO audio files are preemphasized, fiamed every 20 msec, then 12th order Melwarped Frequency Cepstral coefficients are extracted. We append normalized energy, then augment this 13 dimension vector with both first and second order transitional coefficients forming a 39 dimensional final representation.
Each YOHO male speaker is modeled by a set of 21 hidden Markov models [31 representing context independent phonemes, Table 1. Each hidden Markov model consists of 3 states represented by a mixture of 3 Gaussian densities. A single diagonal covariance is shared amongst all models. Each speaker template is thus modeled by 7788 parameters. An utterance, U , is scored against a set of speaker i models, A,, using forced alignment (Viterbi decoding) based on the known transcription. This procedure provides a normalized log likelihood of the utterance given the models, logp(UJA)). This method gave best performance over other less-constrained Viterbi decoding methods [31. Training iterations use Baum-Welch reestimation, with initial model bootstrapped from previously trained, similar TIMIT phoneme HMMs.

Cohort Normalization
For a particular speaker, enrollment data is used to choose the closest set of speakers, known as cohorts. Reynolds [91 provides a symmetric distortion measure between two models using enrollment utterances from both the target and the potential cohort to determine similarity. If Ui , A i represent speaker i training observations and model respectively, then a symmetric distortion measure can be defined as Stepping through each target speaker i , we examine the closeness of each potential cohort speaker j. The end result of this process is a sorted list of cohort speakers, unique for each target speaker, which will provide a beneficial normalization to the claimed speaker's score. Other methods of defining cohort speakers include geometric mean, maximum pick within cohort pool and second order Bhattacharrya distance.
The likelihood ratio test is a useful tool based on Bayesian analysis for performing speaker verification. The Bayes error rate, a statistical upper bound on performance of any pattern classifier, is achieved by applying the Bayes decision rule. This log likelihood ratio, using probability density functions, either known or approximated, is as follows: Speaker verification systems are then based on this loglikelihood ratio C of the utterance (or set of utterances) by applying the concept to a claimed model (A, ) against not the claimant or impostor (A,).
If the above quantity is greater than the threshold T , which accounts for the unknown a prioris, the maximum likelihood decision is to accept the utterance U as the claimed speaker. We seek to approximate this last term using a set of "close'' reference speakers, as suggested by Higgins [71. Campbell establishes methods for testing on YOHO by calling these reference speakers "cohorts." Researchers, such as Rosenburg 1101 and Furui 1111 present several measures for cohort normalization, each an approximation to the last expression of the log likelihood (Equation 1). Specifically, define a set of cohort speakers S of size (SI. Then, using the geometric mean over the set of cohort speakers, for example, the log-likelihood ratio is given by Thus, this cohort normalization simply becomes the average of the log likelihood scores over the cohort set.
The overall speaker verification system reported here operates as follows 131, shown in Figure 3. Each utterance is scored, using forced alignment, by the likelihood the utterance was generated by the claimed speaker's model. To remove word effects, this score will be normalized by 5 or 10 reference speakers for the claimed speaker. Lastly, this normalized score will be compared to a global, speakerindependent threshold. Equal error rates are often reported which vary this threshold until the false acceptance error rate equals the false rejection error rate. The zero rejection rate occurs when the threshold is set to have a false rejection rate equal to zero.

Cohort Replacement Results
Typically, results are shown for tests excluding cohorts as impostors. We examine the statistical difference when cohorts are allowed to test, as well as when we apply the Modifzed cohort set methodology. This new method includes a cohort speaker as an impostor, but removes their score in the log-ratio normalization, while adding the next closest cohort. This insures the cohort set size remains constant. This approach attempts to model good impostors in the general population which are not known or available for inclusion in the cohort or reference pool. For each of the 106 targets, the Modied method iterates through the testing of all 105 possible impostors. €Qual error rates are shown in Table 2 with zero rejection rates provided in Table 3. produced results that differed significantly from both methods.
In Figure 2, we examine the cumulative percentage of the False Acceptance errors attributed to each of the 20 closest cohort speakers. This procedure demonstrates to what degree the cohort speakers contribute to False Acceptance errors. For example, using the typical cohort set size of 5 and testing with 4 combination lock phrases, the first 5 cohorts account for only 2% of the total False Accept errors when tested using the Allowed method. This test confirms Campbell's hypothesis that a speaker would be rejected when their model contributes to the reference score. When allowed to test with the Modijied approach, these h s t 5 cohorts account for 32% of the total False Accept errors.

Bounding the Error Rate
In addition to using a fixed database wisely, it is often useful to provide bounds on the performance. The question of independence of biometric samples is often doubted and scrutinized. This section examines reporting equal error rate (EER) with confidence intervals, using both parametric and non-parametric techniques.

Parametric Strategies
It should h s t benoted that a data set provides not only an estimate of the EER but also the accuracy of EFX, namely the standard error of estimation [61. This error of estimation can be applied to statistical inference to infer acceptability of a biometric system or to compare the statistical difference between two systemsan important tool when many biometric systems are beginning to hit the market. Parametric approaches to confidence intervals attempt to uncover the standard error of an estimate. For example, letting the number of correct false accept (impostor) errors be k under N independent tests, the point estimate of the error rate p is = k / N . For large N , we use the DeMoivre large sample approximation to the binomial resulting in a (1 -a ) confidence interval expressed by tion. Under certain conditions, a Poisson may be used to approximate the binomial. It is noted that IC is approximately normal when N + cm with mean N p and variance pqn, q = 1 -p . Hoe1 [81 provides some experimental insight, in that this approximation is only valid when N p > 5, p 5 . 5 or N q > 5 and q > .5.

Resampling Strategies
Resampling, or bootstrapping, is a method of computational statistics which often avoid many parametric assumptions concerning distributions of the data[6,5,41. For the verification problem reporting EER, a two distribution method of resampling must be accomplished [6,41. This method estimates the underlying distribution G from the empirical probability distribution of the actual data, using a Monte Carlo method. The bootstrap sample is created by making N independent draws from the data, with replacement. Then, the bootstrap statistic is evaluated on the bootstrap sample, e* = e(gT,gf,. . . ,g&). After a large number of bootstrap estimates, the standard error can be evaluated or the nonparametric confidence intervals from the bootstrap histogram. The 1 -2a confidence interval simply uses the lOOa and lOO(1-0) percentiles of the bootstrap histogram.
Specifically for the statistic EER a sampling from two distributions, G and H , is required. Refer to As suggested by Campbell in ICASSP 95 [l], one way of removing bias from the reported EER statistic (biased low), was to resample the speaker pool. Instead of fixing the scores and performing a resampling of those EER scores, as recently suggested by Diegert [41, one fixes the speaker population and resamples a target pool, a cohort pool and an impostor pool. Refer to Figure 5.2. For example, each bootstrap sample could consists of selecting target speakers, cohort speakers and impostor speakers. Subsequently, for of the targets, a sorted list of cohorts will be calculated, based on enrollment data. Using test data, each of the targets will attempt to gain access as themselves resulting in a set of FR scores. Then, the set of impostors will attempt to gain access as each of the targets, resulting in a set of FA scores. The EER for this bootstrap sample is found be varying a global threshold. This procedureis repeated for B bootstrap samples.
Fiewe 5 compares each of the three methods of bounding Equal Error rate: resampling the log-likelihood ratio scores, resampling speakers and the parametric confidence interval using Equation 3 in Figure 5. Confidence Interval estimates are the tightest for the the resampling, with replacement of the log ratio scores, and appear widest for the resampling of speakers.

Discussion
This research first examined three test procedures for biometric systems. The goal is to unbias the impostor testing to insure that reliable and realistic statistics are reported. We report our equal error rates with two different cohort set sizes, two variations on combinations/test, and three impostor testing methods. We analyze the confidence interval for the difference between the Modified method and either the Removed or Allowed methods. In all cases of Table 3, we conclude there is a significant difference, using a 95% confidence interval, between the Modified ZRR and both the Removed and Allowed ZRRs. We have proposed a method where results are less biased than standard reporting procedures. Next, we then review three methods of bounding the true system performance of a speaker verification system. By far the simplest method involves assuming the true EER is within a symmetric confidence interval about the point estimate of FER The other two techniques use computational statistics to repeatedly resample either the log ratio scores or speakers to provide an non-parametric estimate of the EER histogram. All three techniques provide similar results, reported here for 1,2 and 4 combinations/test.