Modeling of Speech Localization in a Multi-talker Mixture Using Periodicity and Energy-based Auditory Features

A recent study showed that human listeners are able to localize a short speech target simultaneously masked by four speech tokens in reverberation [Kopčo, Best, and Carlile (2010). Here, an auditory model for solving this task is introduced. The model has three processing stages: (1) extraction of the instantaneous interaural time difference (ITD) information , (2) selection of target-related ITD information (" glimpses ") using a template-matching procedure based on periodicity, spectral energy, or both, and (3) target location estimation. The model performance was compared to the human data, and to the performance of a modified model using an ideal binary mask (IBM) at stage (2). The IBM-based model performed similarly to the subjects, indicating that the binaural model is able to accurately estimate source locations. Template matching using spectral energy and using a combination of spectral energy and periodicity achieved good results, while using periodicity alone led to poor results. Particularly, the glimpses extracted from the initial portion of the signal were critical for good performance. Simulation data show that the auditory features investigated here are sufficient to explain human performance in this challenging listening condition and thus may be used in models of auditory scene analysis.


I. INTRODUCTION
Human listeners are able to attend to and understand one specific talker in complex acoustic settings, such as reverberant rooms in which multiple talkers speak at the same time (e.g., Bronkhorst, 2000).One aspect of this ability is the localization of the attended target talker in a multitalker environment.How the monaural signal-related auditory features, used for the discrimination of the target against the maskers, are combined with binaural features to identify the location of the target is still largely unknown (e.g., see Shamma and Fritz, 2014).This study examines this question by simulating a localization task in a multi-talker setting using an auditory model, and comparing it to human data.Periodicity and spectral energy were investigated as monaural features; interaural time differences were used as binaural features.One important characteristic of the proposed model is that it uses a priori information about the target speech token, similar to the optimal detector approaches established in psychoacoustic detection models (e.g., Dau et al., 1996).This way, information about the target is used optimally, making it possible to assess the relative salience of the features and their interaction in solving the task, which is the main purpose of this study.
The ability of human listeners to localize speech in complex listening scenarios depends on a number of factors.It has been shown that frontal azimuth localization performance degrades with decreasing SNR (Kopc ˇo et al., 2010), increasing number of maskers (Langendijk et al., 2001), masker uncertainty (Kopc ˇo et al., 2010) and reverberation (Gigue `re and Abel, 1993).Experimental results suggest that, primarily, the binaural features at the onset of a sound (Houtgast and Aoki, 1994;Freyman et al., 1997), or at rising segments of the signal envelope (Dietz et al., 2013) are used for localization.
Auditory modeling of frontal azimuthal localization has been done using physiologically inspired models based on normalized cross-correlation (Faller and Merimaa, 2004;Roman et al., 2003) or on the extraction of instantaneous interaural phase differences (IPDs; Dietz et al., 2011).In some cases, these models contain a measure for identifying robust binaural information: Only binaural information with a high interaural correlation (Faller and Merimaa, 2004), or a high interaural vector strength (IVS; Dietz et al., 2011) is taken into account to estimate locations.Using these measures is especially important for scenarios that include reverberation and multiple sound sources.It has been shown that these models can accurately estimate the locations of multiple talkers.However, the models alone are not able to determine which of the segregated sources is the target and which are the maskers.Furthermore, the models' performance was not previously compared to human data.
To identify a speech target in a multi-talker mixture, further features are needed.It has been shown that periodicity is an important cue for distinguishing between different talkers (e.g., Darwin, 1981;Alain et al., 2005).Another important cue is the spectral profile (Gockel and Colonius, 1997;Gockel, 1998).
In a typical auditory scene analysis task, several features need to be integrated to distinct auditory objects.One principle that guides this integration is that temporally and spectrally coherent features are bound to the same object (Elhilali et al., 2009;Shamma et al., 2011;Teki et al., 2013).
The present study introduces an auditory model that simulates the task of localizing a female speech token presented simultaneously with four male speech tokens arranged in different spatial configurations (Kopc ˇo et al., 2010).The scene is complex in the sense that it has a high number of maskers, a relatively low SNR of À6 dB, a simultaneous onset of target and maskers, a complete temporal overlap of the target word by the masker words, short utterances (mostly < 300 ms), a slightly reverberant environment, unknown masker words, and an unknown spatial masker configuration.
This study investigated three different aspects of the simulated localization task.First, it examined whether an auditory binaural model (Dietz et al., 2011) is suitable for modeling human localization performance in this challenging condition when optimal selection of target-related binaural information is assumed.For the optimal selection of targetrelated features, an ideal binary mask (IBM) was used (Wang, 2005;Barker and Cooke, 2007).Second, it investigated how target-related features can be selected using a priori knowledge about the unmasked target utterance.For this, we adopted the optimal detector method developed to predict human detection performance (e.g., Dau et al., 1996).In particular, a template was generated that consisted of the extracted monaural features of the unmasked target.Then, a template-matching procedure compared the template with the respective features from the multi-talker input signal and selected the matching time-frequency bins.Under the assumption that auditory features occurring in the same time-frequency bin belong to the same source (Shamma et al., 2011), the target-related binaural information was read out from the selected bins, whereas the binaural information from the remaining bins was assigned to the maskers.As monaural features, periodicity, spectral energy, and a combination of both features were compared to investigate the relative salience of these features.Finally, the present study investigated the importance of early vs late portions of the signal in the localization task.This was done by analyzing the localization accuracy of the model for the early and late signal portions, and by manipulating the selection of targetrelated information using mixtures of template-matching procedures and optimal IBM-based selection.

II. MODEL DESCRIPTION
Figure 1 shows the outline of the model.The input signal was a multi-talker signal, as used by Kopc ˇo et al. (2010).The task was to estimate the location of the target-the word "two" uttered by a female talker-presented simultaneously with four different male masker speech tokens originating from four different locations.A detailed description of the stimuli is given in Sec.III A. First, the left and right channels of the input signal were preprocessed by a model of auditory periphery, and auditory features were extracted from the preprocessed signals (Fig. 1, model part A).Binaural features were calculated using a slightly modified version of the binaural model of Dietz et al. (2011); monaural features were periodicity (Chen and Hohmann, 2015) and spectral energy.Second, target-related binaural features were selected using a binary mask (BM) (Fig. 1, model part B).This BM could either be an IBM (which replaces the stages in the dasheddotted box), derived by analyzing the target and masker signals separately, or BMs based on a template-matching procedure that compared the monaural features derived from the target alone with those derived from the target and masker mixture signal.Third, the final target location was estimated based on the distributions of selected and not-selected binaural features across the whole utterance (Fig. 1, model part C).It is important to note here that the binaural and periodicity features were pre-selected according to a robustness measure.It is thus assumed that each pre-selected feature value mainly represents a single sound source and that a binary decision as implemented here is sufficient to separate target-and background-related feature values.A detailed description of the three model parts is given in the following.A. Feature extraction

Auditory preprocessing
The left and right multi-talker input signals passed a preprocessing stage based on a model of the auditory periphery as used by Dietz et al. (2011).In brief, this model includes a middle ear band-pass filter, a gammatone filter bank with 23 filters ranging from approximately f c ¼ 200 Hz to f c ¼ 5 kHz, followed by an instantaneous compression, half wave rectification and a low-pass filter.In addition to the original model by Dietz et al. (2011), a differentiator was implemented after the low-pass filter to remove the DC component before extracting periodicity features as described in Sec.II A 3. The signals then passed a fine structure filter (for f c < 1400 Hz) or envelope filter (for f c > 1400 Hz).In line with the original model, the fine structure filters were set to the respective center frequency f c of each filter and a bandwidth of f c =3.The envelope filter was set to a center frequency of f m ¼ 250 Hz and a bandwidth of 250 Hz to cover the full range of the target talker fundamental frequency (ca.170-270 Hz).

Binaural features
Binaural features were extracted as described by Dietz et al. (2011).This model computes the IPDs as a function of time t in each frequency band f c .ITDs are calculated from IPDs and the sub-band instantaneous frequency.Interaural level differences (ILDs) are extracted from the preprocessed signals before the differentiation stage; the sign of the ILD is used to resolve IPD ambiguities in the fine structure filters.ITDs are low-pass filtered using a time constant s, which defines the binaural temporal resolution.As a measure for the robustness of the binaural features, the IVS was calculated.Only those binaural feature values are further processed whose corresponding IVS values exceed a threshold IVS 0 .A second measure for robustness is the "rising flanks" criterion.That is, only those features are further processed where the derivative of the IVS time signal is positive.As a binaural time constant, we chose s ¼ 1=f c for the fine structure channels and s ¼ 1=f m for the envelope channels; as a threshold for robust information, we chose IVS 0 ¼ 0.9.In the original Dietz et al. (2011) study, these parameters were set to s ¼ 5=f c resp.s ¼ 5=f m and IVS 0 ¼ 0.98.The parameters were changed in this study to achieve a sufficiently high number of robust features during the short duration of the target utterance.
ITDs were mapped to azimuth angles a 1 ðt; f c Þ using a fitting function that was calculated similarly to the one described by Dietz et al. (2011): First, we generated speech signals based on the speech corpus employed (Kidd et al., 2008); each signal consisted of one random word uttered by one random talker of the experiment.This utterance was convolved with the binaural room impulse response (BRIR) for a specific direction ranging from À60 to 60 in 10 steps; the same BRIRs were used to generate the input signals in the simulations (see Sec. III A).The final 0.8 s of each signal were discarded because they tended to be dominated by reverberant energy.Second, we extracted ITDs and ILDs of these signals, as described earlier; as parameters, s ¼ 2:5=f c or s ¼ 2:5=f m and IVS 0 ¼ 0.98 were chosen.Third, we calculated one ITD for each azimuth direction a as the median of the ITDs across time.For each azimuth, 25 iterations with random words and random talkers were done and the ITD was found as the median across these iterations, resulting in values ITDðaÞ for each f c .Fourth, a linear fitting function was applied to the inverse values aðITDÞ for each f c .The parameters s and IVS 0 for the calculation of the lookup table were chosen to select robust binaural information for the single-source reference signal and differed from the parameters used for the extraction of binaural features in the simulations.The target-localization in quiet (cf.Fig. 2), however, was not influenced by the change in the parameter values.
Azimuth signals a 1 ðt; f c Þ were then downsampled from fs 1 ¼ 44:1 kHz to fs ¼ 1 kHz in order to reduce storage usage and to provide better temporal alignment with the periodicity and spectral energy features, both of which were extracted with a sampling frequency of 1 kHz.The downsampling algorithm calculated the mean value of IVSselected binaural information every 1 ms, resulting in a sampling frequency of fs ¼ 1 kHz.The resulting signal is referred to as aðt; f c Þ.

Periodicity features
Periodicity features were extracted from the preprocessed signals.They were based on the extraction of the normalized "synchrogram" Sðt; f c ; PÞ (Chen and Hohmann, 2015).The normalized synchrogram Sðt; f c ; PÞ is the ratio of the harmonic signal energy for the period P and the total signal energy in the same time window for a [t, f c ] bin, computed for a number of tested candidate periods P 0 .If Sðt; f c ; PÞ ¼ 1, the signal is fully harmonic with a period P; if Sðt; f c ; PÞ ¼ 0, there is no harmonic energy at the period P. It was therefore assumed that the locations of local maxima with high peak values of the synchrogram function across candidate periods P 0 correspond to the dominating fundamental period P0 and its multiples.The extraction of periodicity features is explained in detail in the following.
The set of all local maxima of the synchrogram for one given [t, f c ] bin was given by P max ðt; f c Þ ¼ fP 2 P 0 j Sðt; f c ; PÞ is a local maximumg: (2) A local maximum was defined as a value which is larger than its two neighboring values.The periodicity features were chosen from the set of local maxima P max ðt; f c Þ if they fulfill certain energy requirements: The largest local maximum had to exceed a value of P 1 , making sure that there is enough harmonic energy in the signal.If this requirement was fulfilled, all local maxima exceeding a certain threshold P 2 were chosen as periodicity features, P 1 and P 2 were set to 0.9 and 0.8, respectively, in fine structure bands (f c < 1400 Hz) or 0.5 and 0.4 in envelope bands (f c > 1400 Hz), respectively.These values were chosen to make sure that periodicity features represent sub-band signal sections with a salient predominant periodicity, similar to the coherence-based selection of binaural features.The periodicity features were determined separately for the left and right channels and in the following, they are referred to as P0 l ðt; f c Þ and P0 r ðt; f c Þ, respectively.

Spectral energy features
Spectral energy features E l ðt; f c Þ and E r ðt; f c Þ were calculated from the preprocessed signals.They were calculated every 1 ms as the mean signal power in a 10-ms rectangular moving window.

B. Selection of target-related binaural information
The binaural signal aðt; f c Þ contains azimuth information of target and masker stimuli.The selection of target-related information was based on BMs that label the targetdominant [t, f c ] bins, or "glimpses," with a value of 1 and all other bins with 0, The selection mechanism was restricted to the initial portion of the stimulus (t 2 ½0; 300 ms), which contained direct-sound energy of the target.In this study, BMs were estimated in several different ways: First, as an IBM; second, via a template-matching procedure of periodicity features (BM P0 ), spectral energy features (BM E ), or a combination of both (BM E,P0 ); third, using a combination of IBM for the early signal portions and BM for the late signal portions, or vice versa.Each of these procedures is explained in the following.

IBM
The IBM is defined as where SNRðt; f c Þ is the ratio of target signal energy and masker signal energy; these energies were calculated analogously to the multi-talker energy described in Sec.II A 4. For this approach, full a priori knowledge about the separated target and masker signals is needed.

Template matching
The selection of target-related binaural information was based on a template-matching procedure that used the monaural features of the target alone as a template.This is in line with the Kopc ˇo et al. (2010) experiment, in which the subjects had the opportunity to create a template, as the experiment included a target-alone control condition prior to the main experiments (see also simulation A in Secs.III and IV).To create the BMs BM P0 ; BM E , and BM E,P0 in each simulated experimental trial, the template of the target's periodicity and/or spectral energy was matched with the corresponding features extracted from the multi-talker mixture.The derivation of the templates and the computation of the BMs is described in detail in the following.
a. Periodicity template matching.To calculate the periodicity template P0 tar ðt; f c Þ, periodicity features were extracted as described in Sec.II A 3. for all possible unmasked target utterances (11 locations with 2 channels each), referred to as the sets P0 tar;i ðt; f c Þ; i ¼ 1; …; 22.Second, a probability density function (PDF) PDFðt; f c ; PÞ across all sets P0 tar;i ðt; f c Þ was calculated as follows: where N ðl; rÞ denotes a Gaussian function with an expected value l and standard deviation r.The factor C was chosen so that the integral of the PDF was one.The resulting PDF was usually a multi-peak function with peaks at multiples of the fundamental period.Third, the peak positions of the PDF were chosen as the possible candidates for the template.A candidate contributed to the template P0 tar ðt; f c Þ if a minimum number of period values from the original sets P0 tar;i ðt; f c Þ lay 610 À4 s from a candidate.These minimum numbers were set to 12 for the fine structure filters and 6 for the modulation filters.
In the template-matching procedure, a given multitalker input's periodicity, P0ðt; f c Þ, was evaluated against the periodicity template, P0 tar ðt; f c Þ, separately for each [t, f c ] bin.Two criteria had to be fulfilled for the input at each ear to consider it a match to the template: (1) the number of periodicity values had to be similar between P0ðt; f c Þ and P0 tar ðt; f c Þ, and (2) the periodicity values found in the input had to be similar to the periodicity values in the template.Specifically, the two criteria were defined as follows.
Criterion 1: The difference of the number of periodicity values in one [t, f c ] bin should not exceed a threshold of 2, The symbol # defines the number of elements in a set.The rule is applied to both the left and right channel periodicity features separately; the corresponding variables are termed Criterion 2: This criterion had two versions depending on whether there were fewer values in the multi-talker input or in the template.If the number of values in the multi-talker input was lower than in the template, then for each periodicity value in the multi-talker input there had to be a value in the template that did not differ by more than 0.1 ms.If the number of values in the template was lower than in the multi-talker input, for each periodicity value in the template there had to be a value in the multi-talker input that did not differ by more than 0.1 ms.Formally, The second rule was also implemented for the left and right channel features individually; the corresponding variables are termed B l ðt; f c Þ and B r ðt; f c Þ.The BM P0 was estimated on the basis of the aforementioned rules that had to apply for both the left and the right channel, b. Spectral energy template matching.The spectral energy template was calculated as the mean of all spectral energy features of the 22 unmasked target utterances.BM estimation based on energy template matching was based on the absolute difference between target template and left and right multi-talker signal c. Combination of periodicity and spectral energy.The BM for the combination of periodicity and spectral energy features was calculated as the product That means that the BM E,P0 is only one in the [t, f c ] bins in which both the periodicity features and the spectral energy features matched the template.

BMs based on early vs late signal portions
To examine how different temporal portions of the signals contribute to the BMs, an additional analysis was performed in which BMs of the early portion of the signal (t 100 ms) were treated separately from the BMs of the late portion of the signal (t > 100 ms).These BMs are referred to as BM early and BM late , respectively.Combinations of different BM types were denoted as additions of BM early and BM late , e.g., the combination of IBM in the onset and BM P0 in the offset was termed IBM early þ BM late P0 .

C. Estimation of target location
To estimate the target location, two PDFs of the location estimates were generated, one based on the selected bins, PDF sel ðaÞ, and one based on the not-selected bins, PDF nsel ðaÞ.The PDFs were generated by summing up Gaussian kernels centered at the selected or not-selected estimated locations at each [t, f c ] bin, C 1 and C 2 were chosen so that the PDF integrals were one.The target location estimate, â, was then defined as â ¼ argmax a ðb Á PDF sel ðaÞ À PDF nsel ðaÞÞ: (16) The factor b controls the relative influence of selected and not-selected azimuth values for the decision.On the basis of pilot experiments, we set b ¼ 3 and the standard deviation of the Gaussian kernels r ¼ 30 .This relatively large standard deviation was chosen because it generates smooth PDFs and thus leads to robust predictions.The subtraction of PDF nsel ðaÞ suppresses the remaining maskerrelated information in the target-related PDF and resembles the active suppression of masker positions (Dong et al., 2013).

A. Stimuli
The speech material used here was the same as that used in Kopc ˇo et al. (2010; speech corpus of Kidd et al., 2008).The target to be localized was a female voice uttering the word "two," which was kept constant throughout the experiment.The target azimuthal location was between À50 and 50 in 10 steps.The maskers were four male voices uttering a random monosyllabic word which completely overlapped the target word.Each target and masker utterance had approximately the same energy, so that the target-to-masker ratio was 0 dB, as stipulated by Kopc ˇo et al. (2010).The resulting SNR was approximately À6 dB.The male talkers were the same throughout the experiment with the same left-to-right order.Five masker location patterns were used: ½À50; À40; À30; À20 ; ½20; 30; 40; 50 ; ½À20; À10; 10; 20 ; ½À50; À40; 40; 50 S, and ½À40; À10; 10; 40 .
The input signals were generated using virtual acoustics.Clean speech tokens were set to a root-mean-square (RMS) of 1 before convolution with a BRIR for the respective angle.BRIRs were measured in the ears of a human listener in a slightly reverberant room (Kopc ˇo and Shinn-Cunningham, 2011).The distance between head and sound sources was 1 m and the azimuth spacing was 10 .All other methods for measuring BRIRs were the same as described by Shinn-Cunningham et al. (2005).In our study, we used only the BRIRs from the left hemisphere and switched left and right channels for the other hemisphere.

B. Simulations
Table I shows an overview of all of the simulations in this study.To assess the model performance for the localization of the unmasked target, a control condition (simulation A) was simulated in accordance with the psychoacoustic study of Kopc ˇo et al. (2010).In this simulation, no selection mechanism was implemented, so that all extracted azimuth angles aðt; f c Þ contributed to the estimated target location [cf.Sec.II C, Eq. ( 16)], Consistent with the computations used for the masked localization simulations, the PDF was calculated based on Gaussian kernels with a standard deviation of r ¼ 30 .Only one model run was performed for each target location, because the target utterance was kept constant, in line with Kopc ˇo et al. (2010).The model did not simulate any of the localization inaccuracies that presumably occur in the psychoacoustic experiment, e.g., due to the head tracking procedure.
Simulation B investigated the model with the selection of target-related binaural information based on the IBM (see Sec. II B 1).The IBM selection requires full a priori knowledge of the target and masker signals in isolation.The simulation can be seen as an investigation of the performance of the binaural model as well as the performance of the location estimation mechanism.
Simulation C investigated the model using BMs based on template matching using periodicity, spectral energy, or a combination of both features (BM P0 ; BM E , or BM E,P0 , see Sec.II B 2).
Simulation D investigated how the model performance depends on information in the early (first 100 ms of the input) vs late portions of the signal (rest of the input).For this, the BMs from simulations B and C were combined such that either the IBM (from simulation B) was used for early signal portions and BM P0 ; BM E , or BM E,P0 (from simulation C) for late signal portions, or vice versa (see Sec. II B 3).
Furthermore, the BMs BM P0 ; BM E , and BM E,P0 estimated using these procedures were compared to the IBM in terms of positive predictive values (PPVs), negative predictive values (NPVs), accuracy (ACCs), and glimpse proportions (GPs).The PPV was defined as the total number of true positives, i.e., bins for which both the given BM and the IBM is one, divided by the number of bins with a value of one in the BM.Thus, it is basically a measure of how many of the selected glimpses are actually target-related, as defined by the IBM, serving as the "gold standard."The NPV is defined as the total number of true negatives, i.e., bins for which both the given BM and the IBM are zero, divided by the total number of bins with a value of zero in the BM.Analogous to the PPV, the NPV is a measure of how many of the not-selected glimpses are actually not targetrelated, as defined by the IBM.The ACC is defined as the sum of true positives and true negatives divided by the total number of bins.The GP is defined as the number of ones in a BM divided by the total number of bins.

C. Descriptive statistics
In the experiment of Kopc ˇo et al. (2010), seven subjects S i participated, each of them performing ten runs per target position / (11 total) and masker pattern p (5 total).Masker words and masker patterns were randomized across ð/; pÞ conditions and subjects.For the model simulations, 50 runs were performed for each / and p with randomized masker words.
For illustration and descriptive statistics, results for the spatially symmetric (masker) conditions, p ¼ 3, 4, and 5 were merged across hemispheres.Furthermore, masker patterns p ¼ 1 and p ¼ 2 are spatially anti-symmetric, and their results were merged after mirroring the data of pattern 2. That is, for each target location the number of runs was doubled by adding the target location estimations of the respective mirrored location, so that 20 runs were examined instead of 10 (subjects) or 100 instead of 50 (model), except for / ¼ 0 in masker patterns p ¼ 3 through p ¼ 5.The same merging was done for the control condition with the difference that the subjects performed 20 runs per target location so that 40 runs were examined for the mirrored data.This procedure reduced the influences of the sequence of maskers and room asymmetries on the results.
Model and subject data were compared with regard to the median bias and interquartile range (IQR) across runs within a ð/; pÞ condition.The median bias is a measure of the deviation from perfect localization, referred to here as D S i ð/; pÞ for the subject S i and D M ð/; pÞ for the model.The IQR was used as a measure for the variation across different runs for a given ð/; pÞ condition, referred to as IQR S i ð/; pÞ for the subject S i and IQR M ð/; pÞ for the model.
As a measure for the similarity between model and subject performance, global and local root-mean-square errors (RMSEs) were used.These RMSEs were always calculated with reference to the medians across individual D S i ð/; pÞ and IQR S i ð/; pÞ, referred to as D S ð/; pÞ and IQR S ð/; pÞ, respectively.The global bias RMSE was used to assess overall performance averaged across location and pattern.It was defined as where X ¼ S i for the subject and X ¼ M for the model.The local bias RMSEs were used to assess performance separately for each combination of pattern and target location.They were defined as RMSE D;local;X ð/; pÞ ¼ jD S ð/; pÞ À D X ð/; pÞj; with the variable X as used in Eq. ( 18).The calculation of global and local IQR RMSEs was done analogously.
The reference measure for the comparison of model and subject results was the mean and standard deviation of the individual subject's global and local RMSE.The performance of the model was considered similar to human subject performance if the model RMSE lay within two standard deviations of the across-subject RMSE mean.For a rough statement about whether human and model performance were comparable or not, and how large the deviation was, the global RMSEs were used.To make a statement about the difference between human and model performance for individual ð/; pÞ conditions, the local RMSEs were used.

A. Simulation of control condition
Model and subject median biases for the control condition are shown in Fig. 2. The results for the À50 to À10 locations were mirror-flipped and combined with the 50 to 10 location results.As the same target utterance was used for all runs at all locations, the model did not show any variance across runs.Hence, the results for model and subject IQRs across runs are not shown here.The model results were in good agreement with the subject results.This was reflected in the model's global RMSE of 3:5 , compared to the mean subject global RMSE of 3:6 62:9 ; that is, the global RMSE of the model lay within 0.06 subject standard deviations of their global RMSE.Analysis of the local RMSE revealed that the model was in good agreement with the subject data for all target locations, as can be seen in Fig. 2. At the 40 location, the model and subject median biases differed considerably.At this location, the mean local RMSE of the subjects was 4:68 64:27 compared to a local RMSE of 8:24 for the model; due to the large variability across subject estimates, the model still fulfilled the criterion of not differing by more than two standard deviations from the mean subject RMSE.

B. Simulation using the IBM
Figure 3 shows the model and subject median biases and IQRs across runs for the masked localization data.The human bias data showed similar localization estimates across patterns (top row in Fig. 3).The main feature across the patterns was that the most lateral sources tended to be biased medially.This effect was strongest in masker pattern 1 for the lateral target locations near the distractors (50 ) and weakest for masker pattern 1 for the target locations far from the distractor (À50 ).The model captured the general trend considerably well.However, it did not show the asymmetry between the data at / ¼ À50 and / ¼ 50 for masker pattern 1.For the IQRs, human data showed a similar behavior across masker patterns (bottom row in Fig. 3).In particular, the IQRs for lateral target positions tended to be larger than for medial target positions.In most cases, this trend was observable in the model data.For masker pattern 1, the IQR in the human data was higher than the model IQR for the target locations near the distractors and lower than the model IQR for the target locations far from the distractors.A similar trend, although weaker, can be seen for patterns 4 and 5.However, there the model predictions seem to be less stable.So the model did not capture well the difference between human IQRs near vs far from the maskers.Generally, the model showed lower IQRs than the subjects, which might be probably due to the fact that the model incorporated idealized knowledge about the target-dominant [t, f c ] bins that the subjects did not have.
The global RMSEs for the median biases were 4:2 62:9 for the subjects and 3:5 for the model (see also Table II).For the IQRs the global RMSEs were 3:6 61:7 for the subjects and 2:5 for the model.That is, for both the biases and the IQRs, the global RMSEs of the model were within two standard deviations of the mean global RMSEs of the subjects.This indicates that the overall model predictions did not differ significantly from the subject data and that the

C. Simulations using template matching
Figure 4 shows the simulation results using the models based on template matching, using a layout similar to Fig. 3.The periodicity model shows a global RMSE of 26:2 for the biases and 12:9 for the IQR (triangles in Fig. 4, see also Table II).Both of these model RMSEs lay outside two standard deviations of the mean global RMSE of the subjects.Particularly, large differences were observed for the left hemisphere of masker pattern 1, in which the model responses showed a very strong bias toward the middle and even toward the masker positions.Also the IQRs were very large for these conditions.However, there were similarities to both the subject biases and IQRs in terms of local RMSE in masker pattern 1 for the on-masker positions.In masker patterns 3 and 5 a good performance was found for the center target positions (/ ¼ 0 ) in terms of both bias and IQR.This performance degraded for the more lateral positions, where the model estimates strongly differed from the subject biases and IQRs.For masker pattern 4, the bias estimates were close to the subject biases; however, with the exception at / ¼ 30 , the IQRs of the estimates were considerably higher than observed in the subject data.
For the energy model (circles), the global RMSE was 6:3 for the bias and 6.9 for the IQRs.Both of these values lay inside two standard deviations of the mean global RMSEs of the subjects.For most ð/; pÞ conditions, the model was in good agreement with the subject results as analyzed with the local RMSEs.The model generally captured the trends of a medial localization bias and the increment of IQRs for lateral positions (masker patterns 3-5).These trends tended to be more distinct in the model than in the subject data.In masker pattern 1, the performance strongly degraded for positions far from the masker positions.
The combined model results (diamonds) had a global RMSE of 3.2 for the biases and 5.4 for the IQRs.Both of these values lay inside two standard deviations from the mean global RMSE of the subjects.The biases were in good agreement with the subject results for all target positions and all masker patterns.The IQRs generally seemed to be higher than the subject IQRs.Significant differences were found within the off-masker locations in masker pattern 1 and for some of the locations in masker patterns 3 and 5. Generally, the results for the combined model nearly approached the performance of the IBM model.

D. Influence of early vs late signal portions
Table II shows the global RMSEs of median bias and IQR for the model using different BMs and BM combinations as a selector for target-related binaural features.BMs were combined using the IBM in the early portions of the signal, and BM P0 ; BM E , or BM E,P0 in the late portions of the signal; or vice versa (see Sec. II B 3).
Using only the early or only the late signal portions, selected with the IBM, generally increased the bias and IQR RMSEs compared to using the whole signal.This increment was stronger when using only the late signal portions, especially for the IQR.These findings suggest than an accurate selection seems to be more important for the early portions of the signal than for the late portions.Still, the most accurate results were found when the complete IBM was used.
Results for the mixed BMs showed that replacing the selection at early and late signal portions by an optimal selection lowered the RMSEs.This effect was strongest for the periodicity model.The ideal selection in early signal portions generally led to slightly lower bias and IQR RMSEs than the ideal selection in late signal portions.
Comparing the results for IBM early and the combination of IBM early and BM late showed diverse results for the different features: While the combinations of IBM early with BM late E and BM late P0;E led to a decrease in RMSE compared to IBM early alone, the combination of IBM early and BM late P0 led to a relatively strong RMSE increase compared to IBM early alone.These findings suggest that the selection in the late signal portions of BM E and BM P0;E is similar to the contribution of IBM late , while the selection in the late signal portions of BM P0 possibly contains false positives and false negatives.

E. Comparison of BMs
The results of simulation B showed that the model performance using the IBM as a selector for target-related binaural features was very similar to the subject results.Using the BMs based on template matching showed different results depending on the monaural features used.It is therefore of interest to compare those BMs to the IBM. Figure 5 shows BMs obtained with the four different approaches for one sample run, alongside the individual PPVs and NPVs, ACCs, and GPs.Table III shows the average measures across all conditions and runs.For the IBM (top left panel in Fig. 5) glimpses were found during the first 50 ms in almost all frequency bands.This was also reflected in a relatively high average GP of ð20:3611:0Þ% within the early portion of the signal (0-100 ms; see Table III).There were also distinct glimpses observable during the late portions (>130 ms) in the modulation channels and the channels with f c ¼ 236 Hz; f c ¼ 414 Hz and f c ¼ 488 Hz.There were no glimpses found after approximately 50 ms in the central frequency channels with f c ¼ 569 Hz to f c ¼ 1470 Hz.This pattern generalized to other sample runs as well, resulting in a much higher average GP during the early portions than during the late portions in the IBM model (Table III).
The GP for the energy model (BM E ) was similar to the GP of the IBM.The pattern of selected glimpses was also similar in the two models: both had a higher GP in the early than in the late portion, and both lacked glimpses in the late portions for frequency bands between f c ¼ 569 Hz and f c ¼ 1470 Hz.In contrast to the IBM and the BM E ; BM P0 was very sparse.The BM E,P0 was the intersection of BM P0 and BM E , and was therefore also very sparse.
It is notable that the very few glimpses in the BM E,P0 were very accurate estimates of the actual target-related glimpses defined by the IBM, as seen in a PPV of 67:1%620:9%.Compared to that, the BM P0 showed a relatively low PPV of 22:5%612:7%, indicating that only a small part of the already few selected glimpses actually identified target-related bins.As seen in Fig. 5, there was a large number of mis-selections during the late portions for frequency channels from f c ¼ 569 Hz to f c ¼ 1296 Hz.This was reflected in a low PPV of BM P0 in the late portions (16:2%610:5%).Interestingly, the PPV of BM E was also relatively low (33:3%612:6%), although the overall congruence with the IBM seemed to be relatively high (see Fig. 5).These results revealed that the false selections arising from the periodicity and spectral energy features in isolation could be largely removed if the two features were combined.
While the PPVs differed between the BMs, their NPVs were relatively similar.NPVs generally had high values around 90%, meaning that the not-selected [t, f c ] bins were generally correctly identified as not target-related.The accuracy was also similar between the different BMs and showed a relatively high value.This may have been due to the dominance of the number of correct negatives in this measure.
Higher PPVs were found in the early portions of BM P0 and BM E , while PPVs were similar for the early and late portions of BM E,P0 .However, for all BMs, the NPV was approximately 10% higher in the early portions than in the late portions, which was also reflected in the accuracy.

V. DISCUSSION
The present study introduced an auditory model for localization of target speech in a complex acoustic environment.The model was evaluated on experimental data in which the target, a female voice uttering the word "two," was masked by four spatially separated male voices (Kopc ˇo et al., 2010).Notable properties of the acoustic scene were a relatively low SNR, short utterances, a full temporal overlap of the target by the maskers, temporally aligned onsets, reverberation, and previously unknown masker words whose spatial configuration was also unknown.The model extracted ITD-based binaural features from the multi-talker scene, selected the targetrelated binaural features based on a BM, and estimated the target location by combining information from the selected and not-selected binaural features.As a selector, we evaluated the IBM and BMs based on a template-matching procedure using periodicity, spectral energy, and a combination of both features.Additionally, the contribution of BMs was examined separately in the early and late portions of the signals.
When the binaural feature selection was based on the IBM, the model performance was in good agreement with the subject performance.However, to create an IBM one needs to have the signal containing the maskers without the target, a requirement that is not fulfilled in regular localization tasks.The model performance using the template-matching BMs depended on the monaural features applied: Using periodicity, the overall model performance was worse than the subjects' performance.Using spectral energy, the performance was only slightly worse than the subject performance.Using both features combined led to subject-like performance in terms of bias and a slight performance degradation in terms of IQR.Replacing these BM-based selections by an optimal IBMbased selection in the early or late portions of the signal led to an improved model performance.

A. Differences between simulation design and experimental setup
Although the task was the same for the model and the subjects, there were some differences between the experiment and the simulation, which may have influenced the general comparability of model and subject performance.
First, the room used to record the BRIRs for the model simulations in this study was similar to, but not the same as, that used to collect the subject data.The differences in room geometry and materials of walls, ceilings and floor may have caused some differences in reverberation and therefore a difference in performance.However, given the good match between the subject and IBM-model performance, this was unlikely a big factor.
Another difference is that the model does not incorporate any individualization to account for differences in behavior between subjects.First, subjects may show some characteristic variabilities arising from the head-tracking procedure in the experiment.Applying an "internal noise" to the model location estimate would account for these variabilities; our present model version, however, does not do this.Second, each individual may have a characteristic response behavior, e.g., a certain azimuth offset.Our model did not account for these kinds of differences.This could be incorporated by modeling each subject individually.

B. Differences between model and subject performance
Several factors may have caused the observed differences between model and subject performance.(1) There may be a difference in how binaural information was extracted and combined, as the model used primarily ITDs while the subjects may have also based their localization on ILDs.
(2) There may be differences in the selection of target-related time-frequency bins (assuming that the humans use such a selection at all).(3) There may be a mismatch between the model and the subjects in the binding process that links the binaural and monaural information related to the target and the maskers to estimate the target location.The data using the IBM as a selector for target-related binaural features showed absolute biases and IQRs comparable to or even lower than the subject data.We can thus assume that the binaural features (stage 1), and the location estimation procedure used here (stage 3) accurately simulate human performance.Thus, larger absolute biases and IQRs in specific models most likely occurred due to inaccuracies in the selection of target-related time-frequency bins.These inaccuracies can arise from both incorrect selection of maskerdominated bins as target bins (false positives) and omission of target bins (false negatives).
The combined BM yielded results close to the subject performance and IBM, although it was very sparse and therefore missed many target-related time-frequency bins.On the other hand, the relative number of false positives (1-PPV) was rather low.This finding suggests that misses are not necessarily a drawback, as long as the few selected bins are accurately estimated.As seen in the results for the periodicity BM, too many false positives can have a large negative effect on the model performance.
False positives occur whenever, by coincidence, the template and the multi-talker mixture differ in feature values by less than the chosen minimum difference threshold.For periodicity, false alarms were observed in the fine structure filters with center frequencies of approximately 600-1400 Hz.This may be due to an overlap of the high harmonics of target and masker signals.Therefore, voiced masker-dominated bins might easily be classified as target bins.For spectral energy, false positives occur whenever the mixture and the target template have a similar energy, while the target is not active in the mixture.
We attempted to reduce the influence of false positives on target location estimation by subtracting the PDF of notselected binaural features from the PDF of selected binaural features before estimating the target location.Because the NPVs were generally high for all BMs, the estimation of the background is considered to be relatively accurate.However, especially for the periodicity model, this method was not sufficient to exclude the influence of false positives.One way to potentially improve the results would be to optimize the parameter b in Eq. ( 16), which determines the relative influence of the PDFs of selected and not-selected binaural features.In the present model, b was optimized for IBM results and was not changed for the other BMs.It is possible that location estimates that are remote from the masker locations would become more accurate if b were decreased; however, this would come at the cost of more inaccuracies for positions close to the maskers.
The influence of false positives and false negatives was especially prominent for masker patterns 1 and 2, in which all of the maskers were in one hemisphere, and the targets were in the other hemisphere (azimuths less than À10 ).Here, masker-related binaural features were all in the range of 20 to 50 , so there was a large difference between them and the target positions.If the number of false positives and false negatives was high, the resulting PDF had a maximum either between target and masker positions or at the masker positions, resulting in large biases from the actual target position, and a wider possible spread of location estimates across runs.However, the subjects did not seem to have a problem localizing the target in these conditions.
C. Influence of early vs late portions of the signal Several studies have shown that binaural information is primarily read out at the signal onset or at rising segments of the signal envelope (Houtgast and Aoki 1994;Freyman et al., 1997;Dietz et al., 2013).The present simulations support these findings.In particular, in simulation D, IBMbased selection led to better results when used in the early signal portions than in the late signal portions.This implies that binaural features in the early portions of the signal are more accurate than in the late portions of the signal; this was expected, since reverberation has a smaller influence in the early portions.
In this experiment, all target and masker tokens started synchronously, so no onset features or temporal order features were available to segregate the talkers.Therefore, correct selection of target-related time-frequency bins was important, especially in the early signal portions when the binaural features are more reliable.Simulation D showed that the combination of IBM-based selection in the early portions and template-matching BM-based selection in the late portions can distinctly improve the results compared to using the template-matching BMs in the early and late portions.Furthermore, the analysis of BMs revealed a generally lower accuracy and NPV of BMs for the late than for the early signal portions: In the early portions, the proportion of misses (false negatives) of all not-selected [t, f c ] bins was higher.On the other hand, PPVs in the early portions were higher than or equal to PPVs in the late portions.These results show that the tested template-matching procedures are not able to bring out enough target-related binaural features available in the early portions of the signal.

VI. CONCLUSIONS
(1) The binaural model of Dietz et al. (2011) is capable of extracting a sufficient amount of ITD information to model localization of speech in a multi-talker masking speech mixture.Together with a location estimation back-end that is based on both target-related and background-related features, the model performance is comparable to the subject performance.However, this requires optimal selection of target-related "glimpses" in the time-frequency plane, e.g., using the IBM; the target localization cannot be achieved based on the binaural model alone.It requires in addition a sophisticated method to separate the target-related glimpses from the masker ones.
(2) Segregation based on target-alone template matching, while more realistic than the IBM-based segregation, could not predict the human data as accurately as the IBM approach when using either periodicity features or spectral energy features alone.However, while periodicity features alone led to a strong performance degradation, spectral energy features were still reasonably accurate.
Combining the two features improved the model performance so that it approached subject performance.
(3) Extracting binaural information from the targetdominated time-frequency bins during the early portions of the signal seems to be important for performing the task in reverberant environments.This is likely because reverberant energy is initially low, and does not affect binaural information.However, neither of the templatematching features was capable of extracting enough of the critically important target-related information during the signal onset.(4) The failure of the template-based BMs to extract the target-related information during the early signal portions indicates a more complex selection process, possibly involving temporal integration and across-frequency integration of correlative extracted features, which was not considered in this study.Alternatively, it is possible that the listeners combined ITDs and ILDs to estimate the target, an option not considered in the binaural model used in this study.(5) Binaural and periodicity features were selected based on a salience measure with a rather strict criterion.It was then assumed that each selected feature either belongs to the target or the background.This means that binaural unmasking as implemented, e.g., in equalization-cancellation models of binaural processing, was excluded.Still, the model performed as well as human listeners.This suggests that explicit modeling of target-masker superposition may not be needed for modeling human sound localization.

FIG. 1 .
FIG. 1. Model outline.Part A: The left and right ear signals are first preprocessed by a peripheral model.After that, binaural and monaural features are extracted; the dashed box identifies the processing steps used in the binaural model adapted from Dietz et al. (2011); the extracted monaural features are periodicity and spectral energy, derived from the left and right channel signals individually.Part B: Based on the monaural features, a templatematching procedure is applied from which BMs are estimated.The selection of target-related binaural features is based on these BMs, or on the IBM (in which case the stages enclosed in the dash-dotted box are replaced by IBM extraction).Part C: The target location is estimated based on the binaural information selected as belonging to the target as well as on the maskers information in all frequency channels.

FIG. 2 .
FIG. 2. Median target localization bias from actual target location as a function of target location without maskers (simulation A, control condition).The dashed line, the gray filled area and the thin gray lines indicate the median, the upper and lower quartiles, and the minimum and maximum of the subjects' individual median biases, respectively (data from Kopc ˇo et al., 2010).The solid line represents the model results.

FIG. 3 .
FIG. 3. Masked target localization modeled using the IBM as a selector for target-related bins (simulation B).Median biases (top row) and IQRs across runs (bottom row) are shown as a function of target positions / for subjects and model.Each panel shows the results for a specific spatial masker pattern p, indicated by the black triangles on the abscissa.Results were merged based on spatial masker pattern symmetry.Medians, lower and upper quartiles and minimum and maximum of individual median biases and IQRs are shown for the subjects as dashed lines, filled areas and thin gray lines, respectively (data from Kopc ˇo et al., 2010).Black circles indicate the median biases and IQRs of the model.Open circles indicate that the local model RMSE was more than two standard deviations away from the mean local RMSE of the subjects.

FIG. 4
FIG. 4. (Color online) Masked target localization biases (top row) and IQRs (bottom row) as a function of target position / for the template-matching models (simulation C).The layout of the figure and the human data are identical to Fig. 3. Different colors and symbols represent the model variations using different monaural features (triangles: periodicity; circles: spectral energy; diamonds: combination of both monaural features).The open symbols indicate that the local model RMSE was more than two standard deviations away from the mean local RMSEs of the subjects.Values that fall outside the plot range are plotted along the plot edges and not connected to the other data points.

FIG. 5 .
FIG. 5. Comparison of BMs, which serve as the basis for the selection of target-related binaural information, for one sample run (top left: IBM, bottom left: BM P0 , top right: BM E , bottom right: BM E,P0 ).Black areas identify the estimated target-dominant time-frequency bins ("glimpses").The calculation of PPVs, NPVs, and ACCs (shown on the right, next to each panel) of the template-based BMs was done with reference to the IBM.Furthermore, the GPs are shown.Vertical dashed lines indicate the separation between early portions (<100 ms) and late portions (>100 ms) of the signal.

TABLE I .
Overview of performed simulations.condition with selection based on template matching using periodicity (BM P0 ), spectral energy (BM E ) and a combination of both features (BM E,P0 ) D Influence of early and late portions of the signal

TABLE II .
Global model RMSEs for the bias and the IQR for the different model versions.RMSEs were calculated relative to the median subject data.Model data were obtained using different types of BMs and combinations of BMs to test the influence of selection of early vs late portions of the signal on localization performance.Z values identify how many standard deviations the global RMSEs of the model differed from the mean global RMSEs of the subjects.
IBM-based model can be used as a reference for evaluating models that do not use optimal a priori information for selection of target-related [t, f c ] bins.Investigating the local RMSEs, it was observed that for most ð/; pÞ conditions, the model did not significantly differ from the subject data (open circles in Fig.3indicate where it did).

TABLE III .
PPVs, NPVs, and ACCs of the different template-matching BMs with respect to the IBM, and GPs for IBM and BMs.The table shows the measures for the whole BMs, and for the early and late portions individually.