Study of widely linear multichannel wiener filter for binaural noise reduction

In this paper, we study the binaural noise-reduction problem using an array of microphones. The widely linear (WL) framework in the short-time-Fourier-transform (STFT) domain is adopted. In such a framework, the microphone array signals and binaural outputs are first merged into complex signals. These complex signals are subsequently transformed into the STFT domain. The WL estimation theory is then applied in STFT subbands with interband correlation to form the optimal WL Wiener filter, which exploits the noncircular properties of the input complex signals to achieve noise reduction and meanwhile to preserve the sound spatial realism. Finally, the time-domain binaural output is reconstructed from the output of the WL Wiener filter using the inverse STFT. The effectiveness of the developed STFT-domain WL Wiener filter for binaural noise reduction is justified using experiments.


I. INTRODUCTION
Binaural noise reduction is an important problem in many applications e.g., hearing aids, virtual/augmented reality, 3D gaming, teleconferencing, etc. It has received tremendous research interest over the last few decades [1]- [10]. Unlike the widely studied subject of monaural noise reduction, which aims only at reducing noise, the objective of binaural noise reduction consists of two aspects: noise reduction (to improve either speech quality or intelligibility [11]) and preservation of sound spatial information. To achieve this objective, a binaural noise-reduction system generally takes multichannel (at least two) inputs from an array of microphones and produces twochannel outputs.
A straightforward way of achieving binaural noise reduction is through the use of some monaural noise reduction techniques to produce two outputs while some constraints between the two outputs are applied to preserve the so-called sound spatial cues [3]- [5]. But this method requires good estimation of the spatial cues and preservation process is in general not optimal. Recently, a widely linear (WL) filtering approach was developed to achieve binaural noise reduction using two microphones [6], [7]. It works in the complex domain by combining both the stereo input and expected binaural output signals into complex signals. Through this, the binaural noise reduction problem is transformed into one of single-channel noise reduction under the WL filtering framework. More recently, this principle was extended to the case of multiple microphones [8]. The WL filtering approach is proven to be effective for binaural noise reduction. However, the timedomain formulation and processing developed in [6]- [8] is in general computationally very expensive. To make the implementation more efficient, the time-domain framework was extended to the short-time-Fourier-transform (STFT) domain in [10], where coefficients from different STFT subbands are assumed to be uncorrelated. This paper is also concerned with the binaural noisereduction problem performed in the STFT domain. In contrast with the previous work reported in [10], the contribution of our paper lies in the following two aspects. First, we show that with the WL model in the STFT domain, there exists some relationship between certain subbands. Second, a WL Wiener filter is developed that takes into account the relationship between different subbands to achieve binaural noise reduction. We will show how to derive the optimal WL Wiener filter when interband relationship is taken into account. The performance of the developed STFT-domain WL Wiener filter is verified using experiments and comparison is made to show the advantage of the WL Wiener filter in this paper over its counterpart in [10].

II. PROBLEM FORMULATION
The signal model adopted in this paper is same as the one used in [8]. Let us consider the scenario where a sound source radiates a signal of interest in a reverberant and noisy acoustic environment. We use a microphone array (with 2M sensors) to capture the signal. Then, the output of each microphone is written as where s(t) is the unknown sound source signal, * denotes linear convolution, g r,m (t) denotes the room impulse response from s(t) to the mth channel, and x r,m (t) = s(t) * g r,m (t) and v r,m (t) are the convolved speech and additive noise, respectively, captured by the mth microphone. All the signals x r,m (t) and v r,m (t) are assumed to be real, broadband, and zero mean. Furthermore, it is assumed that the signals x r,m (t) are uncorrelated with v r,m (t). By definition, x r,m (t) are assumed to be coherent across the array, while v r,m (t) may be either partially coherent or incoherent across the array. To achieve binaural noise reduction, we need to simultaneously recover the speech signals at two of the 2M microphones. Without loss of generality, we choose to recover x r,1 (t) and x r,M +1 (t). Following the principle in [6], [8], we choose to work in the complex domain by merging the real array outputs into complex signals so that the original problem is converted to one of multiple-input-single-output noise reduction. With the real signal model given in (1), the complex signals used in this paper are formed as where j = √ −1 denotes the imaginary unit, g i (t) = g r,i (t) + jg r,M +i (t) is the complex acoustic impulse response for the ith complex channel, x i (t) = x r,i (t) + jx r,M +i (t) is the complex clean signal, and v i (t) = v r,i (t) + jv r,M +i (t) is the complex additive noise. With the above complex signal model, the binaural noise-reduction problem can now be restated as: minimizing the effect of the noise term, v i (t), thereby recovering the complex signal x 1 (t), including the spatial information embedded in it.
As demonstrated in [6], [7], all the signals y i (t) are noncircular complex random variables (CRVs). So, the WL filtering theory needs to be used in order to recover x 1 (t) from the M complex noisy signals y i (t).
In the STFT domain, we can rewrite (2) as where at frequency-bin k (with k = 0, 1, . . . , K − 1 and K being the total frequency bins) and time-frame n. Putting Y i (k, n), i = 1, 2, · · · , M , into a vector notation, we get where y(k, n) standing for the transpose operator, and x(k, n) and v(k, n) are defined analogously to y(k, n).

III. CORRELATION BETWEEN DIFFERENT STFT SUBBANDS
In monaural noise reduction in the STFT domain, coefficients from different STFT subbands are assumed uncorrelated either implicitly or explicitly and noise reduction at different bands are typically processed independently. This is generally true for real signals if the length of the fast Fourier transform (FFT) is sufficiently large. The same assumption was adopted in [10] for binaural noise reduction in the STFT domain with the WL framework. However, with the signal model given in (3), there exists certain relationship between the STFT coefficients at the k and (K − k)th subbands [12], [13]. As a matter of fact, it can be checked from (3) that where the superscript * stands for complex conjugation, and G i (k) is the STFT coefficient of g i (t). Therefore, both the coefficients from the k and (K − k)th subbands should be considered together in order to recover the clean speech at the kth subband. To explore this relationship, let us define the following signal vector: where x(k, n) and v(k, n) are defined analogously to y(k, n), respectively. It follows then that where Combining (6), (7), and (8), we obtain where .
From the signal model (11), one can see that the binaural noise-reduction problem now is changed into one of estimating X 1 (k, n) from the complex signal vector y(k, n).

IV. STFT-DOMAIN WIDELY LINEAR FILTERING FOR BINAURAL NOISE REDUCTION
The estimation of X 1 (k, n) from the complex signal vector y(k, n) can be accomplished using the WL estimation theory [14]- [16] as where the superscript H denotes the conjugate-transpose operator, h(k, n) and h ′ (k, n) are two complex finite-impulseresponse (FIR) filters both of length 2M , is a vector of length 4M , named as the augmented WL filter, is the augmented noisy signal vector, also with a length of 4M , and x(k, n) and v(k, n) are defined analogously to y(k, n).
If we set h ′ (k, n) = 0 2M (where 0 2M is a 2M × 1 vector consisting of all zero elements) for any k and n, (13) degenerates to the classical linear filtering framework [17], [18]; however, this classical filtering process is not optimal for noncircular signals [14].
From (13), one can see that X 1 (k, n) depends on the signal vector x(k, n); but the desired signal at frequency-bin k and time-frame n is X 1 (k, n) instead of the whole vector x(k, n). To see how each element in x(k, n) contributes to the estimate X 1 (k, n), let us first decompose X * 1 (k, n) as where is the second-order circularity quotient [19] of X 1 (k, n), and E[·] denotes the mathematical expectation. If γ X1 (k, n) = 0, X 1 (k, n) is second-order circular; otherwise, X 1 (k, n) is noncircular. The absolute value of γ X1 (k, n), which is between 0 and 1, quantifies the degree of noncircularity of X 1 (k, n); a larger value of |γ X1 (k, n)| indicates that X 1 (k, n) more noncircular. From (16), it can checked that Using (16), we can write x(k, n) as where x ′ (k, n) Now, substituting (20) into (13), we get where n) is called the residual interference, and V rn (k, n) h H (k, n) v(k, n) is called the residual noise.
One can verify that the two vectors y(k, n) and y(K −k, n) satisfy the following relation: is the anti-diagonal matrix which has the properties of P T = P and P 2 = I 4M , I M denotes the identity matrix of size M × M . Therefore, y(K − k, n) is simply a permutation of y(k, n). It follows then that where Φ y (k, n) E[ y(k, n) y H (k, n)] is the covariance matrix of the noisy signal vector. The above relationship can be used to reduce the complexity of the WL noise reduction filter, which will become clear in the next section.

V. WIDELY LINEAR WIENER FILTER
Before deriving the optimal STFT-domain WL Wiener filter, let us first define the subband mean-square error (MSE) between the estimated and clean signals at the frequency-bin k and time-frame n: The WL Wiener filter is derived by taking the gradient of the subband MSE, J(k, n), with respect to h H (k, n) and forcing the result equal to zero. The solution is are the covariance matrices of x(k, n) and v(k, n), respectively. According to (21), we have where ϕ X1 (k, n) = E[|X 1 (k, n)| 2 ] is the variance of X 1 (k, n). So, we can also write the WL Wiener filter as where ϕ Y1 (k, n) and ϕ V1 (k, n) are, respectively, the variances of Y 1 (k, n) and V 1 (k, n), and d Y1 (k, n) and d V1 (k, n) are defined analogously to d X1 (k, n) in (21). With the derived WL Wiener filter, the resulting signal estimate at (k, n) is n) y(k, n). Now using the relationship in (27), the estimate of X 1 (K − k, n) can be obtained as n) y(k, n), (33) where i 4M,3M +1 is the (3M +1)th column of I 4M . Inspecting (32) and (33), one can see that we only need to estimate the WL Wiener filter for half of the total STFT subbands, which is similar to the case of monaural noise reduction with real input signals.

VI. EXPERIMENTS
Now, we briefly evaluate the performance of the developed STFT-domain WL Wiener filer using experiments. For comparison, the filter developed in [10] is also evaluated. The experiments are configured using the room impulse responses measured at Bell Labs Varechoic Chamber [20], [21]. We consider a moderate reverberation condition with the reverberation time T 60 of approximately 0.24 s. An equispaced linear microphone array with 8 omnidirectional microphones is configured: the first sensor is located at the position (3.037, 0.500, 1.400) (in meters) and the last sensor is place at (3.737, 0.500, 1.400), the spacing is 0.1 m. To simulate a moving source, we play back some speech signals from the TIMIT database [22] and change the position of the source every 4 seconds among positions (1.337:1.000:4.337, 1.938, 1.600) (forth and back). The microphone signals are generated by convolving the source signal with the corresponding impulse responses and white Gaussian noise is then added to the convolution results to control the input signal-to-noise ratio (SNR) to be 5 dB. All the signals are resampled from the original sampling rate to 8 kHz. Note that in this paper we put aside the influence of noise estimation on performance and compute the covariance matrices directly from the noisy and noise signals using a recursive method with the two forgetting factors λ y = λ v [23].
Both the fullband output SNR and speech distortion index [6] of the developed WL Wiener filter and the filter in [10] are plotted in Fig. 1. We observe that both filters are able to improve the output SNR considerably, but with some distortion being added into the speech. Comparatively, the WL Wiener filter developed in this paper can yield better performance, i.e., higher output SNR and smaller value of the speech distortion index when the forgetting factors are properly chosen. It is interesting to notice that the developed WL filter requires only half the number of microphones for obtaining a similar performance achieved with the method in [10].   [24] scores of both the developed WL Wiener filter and the filter in [10] as a function of the forgetting factor, λ y . Since the PESQ standard does not support complex signals, we take the left-and right-channel outputs from the enhanced complex speech signals and compute the PESQ scores separately. It can be observed from Fig. 2 that the PESQ score first increases with λ y and then decreases. Comparatively, the WL Wiener filter developed in this work achieves a higher PESQ score than the method in [10]. Based on the results in Fig. 2, Table I gives the difference between the maximum PESQ scores that are achieved with the two WL Wiener filters with properly chosen forgetting factors. To visualize the preservation of the sound spatial information, we computed the cross-correlation function (CCF) between the signals at the two output channels (estimating the signal of interest from the first and 5th microphones) every 128 ms. The CCFs are computed using a short-time average method as in [6]. The contours of the time-varying CCFs of the clean, noisy, and two enhanced signals are plotted in Fig. 3, where 8 microphones are used, i.e., M = 4, and value of the forgetting factors for the method in [10] is 0.89 and that of the developed Wiener filter is 0.92 (the value is chosen according to the maximum output SNR that the respective filter can achieve as in the previous simulation). In Fig. 3, the maximal value of the CCF at each time can be seen as the current position of the moving speech source. At the presence of noise, one can note that the sound spatial effect has been dramatically modified. From the third and bottom traces in Fig. 3, One can see that both the method in [10] and the developed WL Wiener filter recover the sound spatial information very well.
To quantitatively compare the performance of the developed WL Wiener filter and the filter in [10] in terms of noise reduction and spatial information preservation, we compute both the output SNR and the Euclidean distance between the clean speech CCF (the CCF between the clean speech at the first microphone and that at the 5th microphone) and that of the enhanced signals. With our experimental setup, the output SNR of the WL Wiener filter developed in this paper is 17.77 dB while that of the method in [10] is 17.16 dB. The distance between the clean CCF and the CCF of the enhanced signals by the WL Wiener filter developed in this paper is 9.13 while that by the method in [10] is 11.88. These results clearly indicate that the developed STFT-domain WL Wiener filter outperforms the method in [10].

VII. CONCLUSION
In this paper, we investigated the binaural noise-reduction problem based on the use of microphone arrays. We adopted the WL filtering framework in which both the multiple inputs and binaural outputs were merged into complex signals, which were subsequently transformed into the STFT domain to achieve binaural noise reduction. The noncircularity property of the complex signals and the interband relationship were subsequently exploited and a WL multichannel Wiener filter was developed. Experiments showed that this WL Wiener filter did not only enhance the noisy speech dramatically, but also recovered the spatial information of the clean speech source. In comparison with a method developed recently, the WL Wiener filter derived in this work yielded higher output SNR, larger PESQ score, smaller value of the speech distortion index, and better preservation of the source spatial information. It was observed that the developed WL Wiener filter only requires half of the number of microphones for obtaining a similar performance of a recently developed method.