Speech enhancement using modulation-domain Kalman filtering with active speech level normalized log-spectrum global priors

We describe a single-channel speech enhancement algorithm that is based on modulation-domain Kalman filtering that tracks the inter-frame time evolution of the speech logpower spectrum in combination with the long-term average speech log-spectrum. We use offline-trained log-power spectrum global priors incorporated in the Kalman filter prediction and update steps for enhancing noise suppression. In particular, we train and utilize Gaussian mixture model priors for speech in the log-spectral domain that are normalized with respect to the active speech level. The Kalman filter update step uses the log-power spectrum global priors together with the local priors obtained from the Kalman filter prediction step. The logspectrum Kalman filtering algorithm, which uses the theoretical phase factor distribution and improves the modeling of the modulation features, is evaluated in terms of speech quality. Different algorithm configurations, dependent on whether global priors and/or Kalman filter noise tracking are used, are compared in various noise types.


I. INTRODUCTION
Speech enhancement algorithms can benefit from including a model of the temporal/inter-frame correlation of speech. Based on [1] [2] and on [3], assuming independence between frames is unrealistic and this assumption could be relaxed by imposing temporal structure to the speech model. Inter-frame speech correlation modeling can be performed with a Kalman filter (KF) with a state of low dimension order, based on [4], [5] and [6]. The modulation-domain KF models the short-term time dependencies between successive frames [4] [7].
Existing KF enhancement algorithms that work in the timefrequency domain differ in their choice of the KF state, the KF prediction and the KF update. The KF state can be in the speech amplitude spectral domain [4] [8], the power spectral domain or the log-power spectral domain [9]. Speech spectra are well modelled by Gaussian distributions in the logpower domain (and not so well in other domains) and mean squared errors in the log-power domain are a good measure to use for perceptual speech quality. In addition, the logpower domain is most suitable for infinite-support Gaussian modeling. Regarding the KF prediction, autoregressive (AR) modeling with or without the AR mean can be performed based on the autocorrelation method or the covariance method [10], allowing or not allowing unstable AR poles.
The KF update is affected by the signal model that is used for the addition of speech and noise [11]. If noise and speech are independent then they add in the complex short time Fourier transform (STFT) domain [12] [13]; it may however be analytically simpler to assume that they add either in the power domain or the amplitude domain [4] [8]. The aforementioned alternative possible ways are related to the phase factor, which is the cosine of the phase difference between speech and noise [12] [14]. We can: (a) assume speech and noise additivity in the power spectral domain, using a phase factor equal to zero, or (b) assume additivity in the amplitude spectral domain, using a phase factor equal to unity. In [4] and [8], (b) is used assuming that speech and noise are Gaussian in the amplitude spectral domain. Regarding (b), assuming speech and noise additivity in the amplitude domain results in noise oversubtraction in the region of 0 dB SNR, which may sometimes be perceptually good [15].
Modulation-domain KF algorithms should be able to distinguish between speech and noise. Global speech priors constitute a mechanism that helps in distinguishing between speech and noise. Amongst other technical papers, log-spectrum global priors have been used in denoising nonnegative matrix factorization (NMF) [16] and in logNMF [17]. Speech enhancement can be performed using global priors because a long-term average speech spectrum (LTASS) model exists for speech [18]. By using the long-term average speech logspectrum, we enhance speech log-spectrum tracking. In this paper, we advance modulation-domain Kalman filtering by utilizing multiple parallel KF updates that use log-spectrum Gaussian Mixture Model (GMM) priors. In [9], we presented a KF-based enhancer that used the log-power spectrum as the KF state and speech-noise additivity in the complex STFT domain as the signal model. In this paper, we extend the KFbased enhancer in [9] to include a GMM speech prior.

II. THE SPEECH ENHANCEMENT ALGORITHM
The flowchart of the algorithm is shown in Fig. 1. The first step is to perform the STFT and then to estimate the active speech level (ASL) [19] [20] and perform ASL normalization. The advantage of ASL normalization is that it permits the use of offline-trained GMM priors that model the distribution of the speech log-spectrum. With the ASL, we have speech models that do not depend on the speech power. The next step is to do Kalman filtering in the log-spectral domain.
In Fig. 1, the blocks in the dotted rectangle constitute the KF. The KF state is the speech log-spectrum and is of dimension p . The KF observation is the noisy speech logpower spectrum y. The algorithm's final step is to keep the first element of the KF state, which is the estimated clean speech in the ASL-normalized log-power spectral domain, transform it to the amplitude domain, denormalize it using the ASL estimate and then reconstuct the clean speech signal using the inverse STFT (ISTFT) and the noisy STFT phase.

A. Notation and the speech-noise signal model
We assume that in the complex STFT domain, the noisy speech is given byȳ d e jθ =s d e jφ +n d e jψ . The amplitudes of the noisy speech, speech and noise are respectivelyȳ d , s d andn d . The subscript "d" denotes that the term is not ASL-normalized. The noisy speech phase is θ, the speech phase is φ and the noise phase is ψ. The ASL-normalized spectral amplitudes of the noisy speech, speech and noise are respectivelyȳ,s andn. Using as the ASL estimate, we have: The log-powers of the noisy speech, clean speech and noise are respectively denoted by y = 2 logȳ, s = 2 logs and n = 2 logn. Within the KF algorithm, we only include the frame index, t, as a subscript in equations involving multiple time frames.

B. The speech KF state and the speech KF prediction
We model the speech time correlation in the speech logspectrum using the KF prediction step. The speech KF state is the ASL-normalized speech log-power spectrum. Figure 2 shows the speech KF state before and after the KF prediction and update. We utilize the linear KF prediction equations: In (1), x t is the speech KF state, which contains the current and the past (p − 1) speech spectral log-powers. In (1), the KF transition matrix is A t , the KF transition noise covariance matrix is Q t and the KF transition noise is w t . The KF transition noise w t is Gaussian, zero-mean and has Q t as its covariance matrix. The KF transition matrix A t is from AR modeling; AR(p) modeling defines the dimensions of the matrices in the KF prediction. The speech AR parameters are a t ∈ p and q 0 is the AR modeling error variance.
We use a time-varying KF: the transition matrix A t and the transition noise covariance matrix Q t depend on AR modeling using the covariance method [10] on the pre-cleaned modulation frame, estimating both the AR coefficients and the AR mean of clean speech. The AR mean is the average clean speech log-power that is estimated as an AR parameter.
In (1) Figure 1. The flowchart diagram of the algorithm. The term z −1 refers to one-frame delay. The blocks in the dotted rectangle constitute the KF.
Noisy y and noise estimate KF update Expand: Gaussian to GMM splitting based on prior modeling (Sec. II.C) µ µ µpr , Ppr, wpr Collapse to single Gaussian Figure 2. The speech KF is shown. We focus on the speech KF state. We expand to h weighted Gaussians based on our Gaussian splitting algorithm using offline-trained log-spectrum global priors, as described in Sec. II.C. Fig. 2, the speech KF state mean is denoted by µ µ µ t and the speech KF state covariance matrix is denoted by P t .

C. The log-spectrum global speech priors
Based on Figs. 1-2, we perform multiple speech-noise KF updates due to using a GMM of h mixtures as global speech priors. We use global priors together with the KF-based local priors. We use a Gaussian splitting algorithm that is based on ASL-normalized offline-trained priors. We multiply the current element of the decorrelated KF state with the global priors. Decorrelation and correlation of the KF state are used to preserve the KF prediction inter-frame modeling. We first decompose the speech KF state covariance matrix P as: where g 0 is the variance of the current element of the speech KF state. We define the linear transformation matrix B by [5]: The next step is to compute the linearly transformed speech KF state Bx t with mean Bµ µ µ and with covariance matrix [5]: In (4), g 0 is preserved. After the multiple parallel KF updates, we correlate the KF state by using B −1 and the inverse of the linear transformation in (4). We use speech GMM priors that are multiplied with the current element of the decorrelated speech KF state after the KF prediction. The decorrelated speech KF state is Bx t so that the current speech log-power s t is uncorrelated with (s t−1 ... s t−p+1 ) T .
In Fig. 2, we compute the posterior weights w + i,t for i ∈ [1, h] after each of the multiple KF updates. Finding w + i,t involves the use of the GMM KF update [21], which in turn involves the use of the nonlinear KF observation model.

D. The phase-factor-sensitive modified KF update
The KF update estimates the posterior of the speech and noise log-powers given the noisy log-power. The KF update is described in more detail in [9]. The KF update considers the Gaussian speech and noise priors from the KF prediction, the distribution of the STFT phase difference between speech and noise using the phase factor α = cos(φ − ψ) [12] [13]: e y = e s + e n + 2e 0.5(s+n) α From (5): α = 0.5 exp (y − 0.5(s + n)) − cosh (0.5u). We use u = n − s and y = 0.5(s + n) + log (2(α + cosh(0.5u))).
s a n b p(s, n) du dα (7) In (7), the integration over u is performed with truncated Gaussians and straight line segments, obtaining a closed-form solution. The integration over α is done using R sigma points, as in [9], utilizing the Unscented transform [22] [23]. In (7), E{α z } is needed for the integration over α with sigma points: E{α z } = 2 −z z! ((0.5z)!) −2 for even z and zero otherwise.

III. NOISE TRACKING AND THE SPEECH-NOISE KF
We now present the noise KF state, the noise KF prediction [24] and the speech-noise KF prediction. With noise tracking, the (s, n) priors are correlated and the KF state ∈ p+q is the speech KF state ∈ p and the noise KF state ∈ q .
We do noise tracking based on AR(q) modeling and on the estimated SNR in the modulation frame [6] [9]. After the noise KF prediction, we decorrelate the noise KF state and, then, we multiply the noise log-power Gaussian with the Gaussian that is obtained from external noise estimation and log-normal noise power modeling [25] [26]. As in (1) that describes the speech KF prediction, for the noise, (n), KF prediction: The joint, (j), speech-noise KF state z t is defined in (9). We use full covariance matrices due to the KF update in Sec. II.D.

IV. IMPLEMENTATION, RESULTS AND EVALUATION
We use acoustic frames of length 32 ms, modulation frames of length 32 ms or 64 ms and a 4 ms acoustic and modulation frame increment. We use the TIMIT database [27] sampled at 16 kHz. For the training of the global speech priors, we use 250 sentences and for testing, we use 40 sentences. We use noise types from the noise database in [28] at SNR levels from −20 dB to 30 dB. Random segments of noise from the noise signals are used [29]. The external noise estimation is based on [30] [29]. For pre-cleaning in Fig. 1, we use the traditional log-MMSE approach [31] [29]. In Secs. II.B and III, we use p = 2 and q = 2. In Secs. II.C-D, h = 4 and R = 3.
For evaluation purposes, we compare the results with and without global speech priors, and with and without noise KF tracking. We consider alternative configurations of the algorithm in Figs. 1-2. Table I shows the Bark Spectral Distortion (BSD) [32] for babble noise at 15 dB SNR. We compute the BSD using no voice activity detection. In Table  I, the BSD of the noisy speech signal is 2.64 × 10 −2 dB.  Fig. 2 and log-spectrum speech priors. EE = Early expanding using the log-spectrum speech priors before the KF prediction. EE assumes GP and EE changes Fig. 2.   Based on Table I, the ST algorithm that does not perform KF noise tracking has a higher BSD than the global priors ST (GPST). This means that the offline-trained priors aid speech tracking; using global speech priors reduces the BSD.
In Table I, we consider early expanding (EE). Figure 2 does late expanding since the global speech priors are used after the KF prediction. On the contrary, with EE, the global priors are used before the KF prediction: the Gaussian-GMM multiplication is performed before the KF prediction. Comparing GPST with EEST in Table I, we note that EE reduces the BSD.
In Table I, using smaller modulation frames (SM) reduces the BSD. The tradeoff is between noisier AR modeling and a modulation frame that is more concentrated in time.
We now use noise tracking and global priors (NTGP). In Table I, we observe a decreasing error from ST to GPST and to NTGPST. With the global priors, as presented in Fig. 2, the BSD error is 0.90 × 10 −2 dB at 15 dB SNR babble noise.
We now examine babble noise at 5 dB SNR. Like Table I,  Table II shows the BSD. The same algorithm notation as in Table I is used. We see a decreasing error from ST to GPST and to NTGPST. The noisy speech BSD is 1.83 × 10 −1 dB.  [33]. The cepstrum is directly related to the minimization of the log-power error that we want to achieve with log-spectrum Kalman filtering.     Tables I-II examine the alternative configurations of the proposed algorithm for specific SNRs. On the contrary, Figs. 3-4 compare the alternative configurations of the proposed algorithm with traditional speech ehnhancement techniques in the SNR range of −20 dB to 30 dB. For comparison purposes, we denote the traditional MMSE approach [34] as TMMSE and the traditional log-MMSE approach [31] as TLMMSE.
We use the speech distortion SIG, noise distortion BAK and overall quality OVRL metrics from [35] [15], which are in a scale of 1 to 5 where 5 indicates excellent speech quality. Figures 5-6 illustrate the ∆SIG and ∆OVRL for babble noise at 15 dB and 5 dB SNR. Considering a specific case, in 15 dB babble noise, ST has the ∆OVRL score of 0.53.
In Figs. 7-10, we use the PESQ speech quality metric for babble, white, aircraft f16 and factory noises. In Figs. 7-10, the presented KF-based algorithms are better than the TLMMSE and TMMSE. We observe that in the SNR range of 0 dB to 30 dB, the presented KF algorithms outperform the traditional noise suppression techniques. We also observe that the best performance of the presented algorithm is when both noise KF tracking and global speech priors are used. As in Figs. 7-10, the ST algorithm is also evaluated in [9] with PESQ.

V. CONCLUSION
In this paper, we present a single-channel speech enhancement algorithm that is based on modulation-domain Kalman filtering that tracks the time evolution of the speech log-power spectrum in every frequency using the long-term average speech log-spectrum. The noise suppression algorithm applies a KF that uses offline-trained log-spectrum priors that are normalized with respect to the active speech level. Denoising is performed with active speech level normalized log-spectrum global priors, by training and utilizing Gaussian mixture models. The KF update uses the phase factor between speech and noise. The KF algorithm is evaluated in terms of speech quality and different algorithm configurations are compared.