A binaural sound source localization model based on time-delay compensation and interaural coherence

Binaural sound source localization is an important technique involving speech capture and enhancement. However, the simple array structure makes it hard to localize sources in complex noisy conditions. This paper presents a novel algorithm based on time-delay compensation (TDC) and inter-aural coherence for binaural sound localization. Firstly, the TDC of binaural signals is used to estimate interaural time-delay (ITD) and interaural intensity difference (IID) instead of generalized cross correlation and logarithmic energy ratio. Then the interaural coherence is utilized to select reliable frames and reduce the variance of ITDs. Finally, a hierarchical framework, which successfully reduces computation complexity, is applied to make a decision of location based on Bayesian rule. Our innovation lies in that both ITD and IID are foremost yielded by TDC. Compared with other popular algorithms, experiments show that the most extrusive superiority of this method is complexity for both time and storage.


INTRODUCTION
Binaural sound source localization (SSL) is an essential and popular technique in many applications such as videoconference, smart rooms, and human-computer interaction, just as the human auditory localization with the capability of pinpointing the sound source swiftly and accurately [1,2]. There are two significant binaural (interaural) cues based on differences in time and level of the sound arriving at two ears called interaural time differences (ITDs) and interaural intensity differences (IIDs) [3,4]. Last decades, a large amount of binaural localization algorithms have been developed in various experimental environments.
Most traditional methods are based on ITD or IID and seldom consider the influence on each other [5][6][7][8][9]. Intuitive- ly, with the influence of ITD, the signals received by two ears have different starting points with respect to sound source, which affects the extraction of IID. Willert et al. presented two-dimensional frequency versus time-delay representation of binaural cues, so-called activity maps [10], and this idea has improved in [11]. Hierarchical system was proposed by Li et al. to cut down matching times [7]. However, ITDs are usually calculated by the classical generalized cross correlation (GCC) [12], and IIDs defined by logarithmic energy ratio, which namely means that two free-running progresses are required to reckon binaural cues.
Accordingly, this paper raises a novel time-delay compensation (TDC) algorithm, which can evaluate ITDs and IIDs by the same processor. Generally speaking, this mentality can effectively decrease the redundancy of realization. The interaural coherence is used to modified time-delay estimate by choosing reliable frames, and the newly GCC-TDC function is put forward to depress ITDs fluctuating. Then, a hierarchical framework based on Bayesian rule is adopted to reduce the computing complexity, in which ITD in the first layer is used to select candidate azimuths and IID to make a decision.
Relation to prior work: This work has focused on an improved version of TDC algorithm, and the localization progress has taken advantage of hierarchical framework. Although Willert et al. has proposed activity maps for binaural SSL, and we have developed TDC to consider the relationship between ITDs and IIDs as well as improve the performance in noisy environments [11], all those previous works count binaural cues in substeps rather than holistic computation, and serialization almost need more time than parallelization. Hierarchical system is put forward by Li et al., which can reduce time complexity but increase space complexity, because more layers need more priori templates. In addition, experiments verified that dividing frequency sub-bands has little help to localization but adding storages by an order of magnitude, because low-frequency signals such as speech can easily go around heads [4,11,13].
The rest of this paper is organized as follows: TDC and hierarchical algorithms are introduced in Sect.2 and Sect.3, respectively. Experiments and analysis are shown in Sect.4. At last, conclusions are drawn in Sect.5. Fig. 1: A brief illustration of this binaural localization framework. The left part is modeling based on interaural-polar coordinate system. The core of right part is time-delay compensation, from which both ITD and IID can be solved.

Feature Extraction
Let s(n) denote a sound source signal, and the received signals as x l (n) and x r (n) on the two microphones or ears, respectively (see Fig.1). Assume that binaural signals are counterparts of sound source with time-delay and attenuation so as to simplify analysis, it can be attained: where a l and a r denote the attenuation factors, τ l and τ r are time factors from the sound source to the two acoustic sensors, v l (n) and v r (n) are the interferences, respectively. Define interaural time-delay Δτ as: Therefore, take the idea of time-delay compensation into account, the relationship between binaural signals will be: where W , λ and Δv denote the window function, attenuation difference and the disparity of noises received by ears, respectively. In fact, Δv is also the error of TDC, and the most amazing task is to make binaural signals without difference. From the standpoint of noises, Eq.(3) can be rewritten as: In office environment, Δv is usually thought as zero-mean Gaussian noise. Hereby the variance of Δv can be defined as: Therefore, the parameters λ and Δτ can be estimated by maximum likelihood estimation as follows: Set this partial derivative to zero and λ, namely interaural intensity difference (IID), can be easily solved as: where N denotes the length of window. As with time-delay Δτ , it's difficult to compute from ∂y/∂Δτ directly, but transformed into frequency domain instead, and Eq. (5) can be rewritten as: where Y (e jω ) and X(e jω ) are the Fourier transform of variance and binaural signals processed by window function. Therefore, if let then ∂Y (e jω )/∂Δτ can be formulated as: Let ∂Y (e jω )/∂Δτ be zero, for jω and e −jωΔτ are not equal to zero, it will be obtained: where * indicates complex conjugate. Then take Eq.(11) back to time domain using inverse discrete fourier transform, it can be shown as: where R(n) is the proposed GCC-TDC function, which rather resembles the Roth weighting [14] based on an optimal filter with x l (n), x r (n) as the input and reference signals [15,16], respectively. Thereout, Δτ can be estimated as: As a consequence, Δτ is the optimal time-delay with the meaning of Minimum Mean Square Error criterion.

Interaural coherence
Based on the aforementioned analysis, ITDs and IIDs can be extracted from TDC. Combined with Eq. (7,12), we can draw that although there is a mutual relationship between ITD and IID, λ has an influence on the height of R(n) in fact. On the contrary, λ is heavily relied on time-delay, thus halcyon ITDs should be calculated first. Hereby, interaural coherence (IC) is employed into GCC-TDC [17,18]. The energies of left and right ear are evaluated by the recursive averages as: where κ marks the frame index with each frame of 5.8ms duration. The smoothing factor α is determined from time constant T and sampling frequency f s as α = 1/(T · f s ) [19].
Here the IC function can be defined as: where E lr (κ, ω) is cross-energy spectrum calculated by: In the following, only cues with ω γ(κ, ω) above the empirical threshold γ 0 are meaningful, otherwise the frame is thought to be unreliable and abandoned. As a result, the proposed GCC-TDC can be modified with γ(κ, ω) as: · e jωn dω (17) Fig.2 illustrates the comparison of performance between the proposed GCC-TDC and the typical GCC-PHAT. It can be seen that both GCC-PHAT and GCC-TDC achieve relatively accurate ITDs, yet the variance obtained by GCC-TDC is slighter for GCC-TDC is fundamentally in view of minimizing variance, which brings about more stable ITDs.

SOUND SOURCE LOCALIZATION
The task of sound source localization is to achieve azimuth θ and elevation ϕ, so to speak, ITD and IID are needed to changed into angels. Considering the geometrical relation in Fig.(1), it can be generated: where d is the distance between two microphones, c is the speed of sound in air (344m/s), and f s is sampling frequency. As to SSL, hierarchical localization framework is utilized. Firstly, the mean of time-delay τ i and the corresponding standard deviation σ i can be trained for each azimuth θ i . Since each time-delay matches one and only θ i , therefore the probability of θ i , named P (θ i | Δτ ), can also be trained before localization. When comes a new sound source, the central azimuth is resolved and an available interval is achieved as follows: Then, consider intensity difference λ in the same train of thought, the average IID μ j and standard deviation δ j can be trained for every direction. Based on the candidate azimuths in previous stage (see Fig.1), the probability of elevation ϕ j and available interval of λ are obtained as:

EXPERIMENTS AND DISCUSSIONS
To evaluate our method, the CIPIC database is used in experiments which is measured by the U.C.Davis CIPIC Interface Laboratory including head-related impulse responses (HRIRs) for 45 different subjects [20]. The parameters used here are shown in Table 1. The method in this paper is short by ICTDC, and the other three compared algorithms are TDC [11], Hierarchical System (HS) [7] and Probability Model (PM) [10], respectively. Experimental sound sources are captured in office environment with different signal-to-noise ratios (SNRs). The results of θ are illustrated in Table 2. We can see that in quite natural environment (40dB), all the four methods can achieve a very high accuracy of up to 90% and has little disparity, but when the SNR is 10dB, ICTDC has reached the best performance of increasing azimuthal accuracy by nearly 10%, which mainly owns to GCC-TDC obtaining more stable ITDs. With respect to elevation ϕ, a more obvious superiority has been reflected in Table 3. It can be obtained that HS lags behind the others seriously, because ICTDC, TDC and PM are the algorithms in same type based on considering the influence of ITD on IID. Besides, ICTDC has adopted interaural coherence function into time-delay estimation, which makes ITDs more robust from considerable reliable frames even in noisy environments.
The algorithm complexity is shown in Fig.3, from which it can been attained that this method requires the least complexity. Fig.3 a) counts the time consumption of these four • ICTDC calculates ITDs and IIDs all at once, which decreases the steps to evaluate binaural cues. • The searching space is lessened to O(n a n e ) (see Fig.3 b)), where n a , n e and n c denote the number of azimuth, elevation and frequency sub-bands, respectively, because ICTDC only needs to store the ITDs and IIDs in n a n e directions referring in Algorithm 1, which derives from that dividing frequency contributes little benefits for TDC. • The excellent matching strategy of hierarchical framework can also deflate candidates directions effectively. Therefore, compared with others, ICTDC is more functional for SSL systems, especially for real-time sound source tracking, and so forth.

CONCLUSIONS
In this paper, a novel binaural sound localization approach based on time-delay compensation (TDC) and interaural coherence is presented. This artifice not only increases the localization accuracy more or little, but the most importance of all is decreasing the complexity for both time consumption down to 0.2s and storage. The TDC relies on the influence of ITD on IID to extract binaural cues, which are foremost calculated by the same processor. Interaural coherence is applied into time-delay estimate, which can incline the variance of ITDs. The final localization is achieved by hierarchical system using Bayesian rule and searching by layers can effectively reduce matching times. Above all, our algorithm is more suitable for practical localization systems.