An automatic text-independent speaker recognition system

A speaker recognition system has been developed which recognizes different voices in a context-free speaking environment. It is composed of an ART II neural network coupled to a pair of fuzzy expert systems which classify voices hierarchically from signal features and clusters. It has been laboratory tested using digitized voices with performance of over 60% correct for about 4 s of voiced speech.<<ETX>>


INTRODUCTION
An in-house investigation performed by the EW/RSTA S y s t e m s E n g i n e e r i n g Division resulted in the development of a Speaker Recognition System(SRS) which recognizes different speakers in a contextfree speaking environment. This paper represents the results of the first year of this R&D effort.
Speaker recognition of text independent information has had limited success to date using short time samples.
Markel and Davis [7] obtained text independent speaker recognition results of 98%; this however required an average of 39 seconds of speech.
Additionally, these experiments were p e r f o r m e d u s i n g l a r g e i n p u t bandwidths and low noise backgrounds.
The described system has more stringent requirements; recognition must be made with an input bandwidth of 4.0 Khz, and short bursts of speech.
The S p e a k e r Recognition System classifies and sorts l i k e s p e a k e r s a c c o r d i n g t o t h e characteristics of their voices. An introduction to the problems can be found in [ l ] The voice characterization is performed in two stages, Signal Processing and Glob Processing.

B.l NEURAL-NET PROCESSING. An
ART-I1 model clusters globs through a set of selected transform features into p re1 im in ary speaker c 1 asses.

B.2 EXPERT SYSTEM PROCESSING.
The preliminary speaker classes are analyzed, rated, and merged as nodes in a fuzzy relational network.
They are processed using a pair of fuzzy expert systems which produces the final voice classification. c h e c k s a n d m e r g e s preliminary speaker clusters to create a final speaker list.
It also returns an Encode/NoEncode command to the neural network f o r inclusion into a given category. It contains two fuzzy expert systems which perform the analysis and decision making functions in the network. Details of each expert system can be found in [2,5] and has the general structure of

TESTING
A series of system tests were performed using four speakers with both text dependent and text independent speech.

A. TEST CASES Various test cases were
run with different data sets from the same four speakers to determine the robustness of the system by vaying several basic parameters.
The input voice data were interleaved for maximum variability so there were never two speech samples (globs) from the same speaker occuring sequentially.
The effect on overall classification performance was measured, as well as extraneous class creation.
The test data varied were: A)Text-independent vs text dependent speech. B)Neural net vigilance parameters. C)"Chunk" and "Glob" times.
From the test data variations, the following results were determined.
A)Cumulative percent correct speaker classification B)Length of speech necessary for a certain percent correct classification. C)Ratio of correct / incorrect classifications per speaker class created, which checks the system's average performance per created class.
B. The length of voiced speech for 60% correct is about 4 seconds minimum with a vigilance of 0.85.

TEST RESULTS
Note that the effects of varying vigilance were also tested and were shown to be minimal over the range 0.85 to 0.99. This of course depends on the signal energy, which should be normalized.
A series of bar charts summarizing the overall effects of specific parameters were also generated.
In a representative bar chart below, text-independent(b1ack) vs text-dependent(speck1ed) were compared showing cumulative percent correct.
The results again indicate a minimum of about 60% correct for varied glob times and fixed vigilance factor / chunk time.
The results of C) were measured by segmenting test results into two sets: One for correct speaker class and the second to "other" class.
For an ideal system, the "other" class would always have n o members.
This test was run also while varying system parameters.
Results were similiarly obtained indicating the correct speaker set dominated the "other" set by 50% or more.