ACOUSTIC FEEDBACK CANCELLATION FOR LONG ACOUSTIC PATHS USING A NONSTATIONARY SOURCE MODEL

Several pro–active acoustic feedback (Larsen–effect) cancellation schemes have been presented for speech applications with short acoustic feedback paths as encountered in hearing aids, but these schemes fail with the long impulse responses inherent to public address systems. We derive a new prediction error method (PEM) based scheme (referred to as PEM–AFROW) which identiﬁes both the acoustic feedback path and the nonstationary speech source model. A cascade of a short– and a long term predictor removes the coloring and periodicity in voiced speech segments, which account for the unwanted correlation between the loudspeaker signal and the speech source signal. The predictors calculate row operations which are applied to pre–whiten a least squares system, which is then solved recursively by means of e.g. NLMS or RLS algorithms. Simulations show that this approach is indeed superior to earlier approaches whenever long acoustic channels are dealt with.


INTRODUCTION
Acoustic feedback, also referred to as the Larsen-effect (howling) occurs in microphone-amplifier-loudspeaker-room systems when the loop gain is larger than one at a frequency where the loop phase is a multiple of 2π.
A conventional solution consists of inserting notch filters into the signal path, thus decreasing the loop gain at those frequencies for which the problem arises.There are several disadvantages to this approach : the system is reactive (the howling phenomenon occurs for about 0.5 seconds before it is detected), the desired signal is distorted by the notch filters, and the 'reverb-like' sound which occurs in a system which is marginally stable is not suppressed.
In this paper, we will focus on single channel acoustic feedback cancellation (AFC) schemes as depicted in Figure 1.This setup does not exhibit the disadvantages summarized above.The estimate of the filter coefficient vector f (k) of the acoustic path F(q, k) = f (k) T q = f 0 q 0 + . . .+ f N−1 q −N+1 from the loudspeaker to the microphone is f (k).Here q −1 is the delay operator.The N coefficients of f (k) are copied at regular time instants to the cancellation filter f0 (k).The loudspeaker signal u(k) is filtered by the room impulse response f (k) and also by the cancellation filter f0 (k).The difference between the cancellation filter output and the microphone signal is the error signal e(k) which should then be equal to the speech source signal v(k) (for a correct model f (k)).In Figure 1, g is the amplifier gain, y(k) is the microphone signal, u(k) = ge(k) is the loudspeaker signal, f is the feedback path impulse response, v(k) is the (speech) source signal, w(k) is the excitation sequence of the source signal, and H(q, k) = (1 + a 1 (k)q −1 + ... + a P (k)q −P ) −1 is a time varying autoregressive (AR) speech model of order P. The coefficient vector of the numerator is a(k).Finally, the q −D block in Figure 1 is a forward delay, which is often unavoidable in digital implementations (buffers for AD/DA-converters, ...) , but which will be exploited further on.
An acoustic echo cancellation (AEC) like approach has been q −D f w(k) H(q,k) − Figure 1: Acoustic feedback cancellation scheme used for AFC in e.g.[1,2].The main complication in AFC compared to the direct identification-approach used in AEC is that in AFC, one can not assume that the speech source signal v(k) is uncorrelated with the loudspeaker signal u(k).Ignoring this, and applying direct identification anyway, would result in a bias in the identified room impulse response [3,4].This bias can be removed using the prediction error method (PEM), which incorporates a speech source signal model into the identification procedure [5].This has been studied mainly in the hearing aids context where the feedback path impulse response is less than 5 msec.In this paper, we focus on public address (PA) systems, where the feedback path typically has a much longer impulse response, e.g. up to 500 msec, and hence an alternative approach will be needed.
Speech, although highly nonstationary over longer time periods, is often considered to be stationary during short frames of ca.20 msec (e.g.160 samples at 8 kHz).Within these frames, it can be whitened by a cascade of a short term predictor (STP) and a long term predictor (LTP).It is required to use data windows of several seconds to estimate the room impulse response, over the length of which the speech signal will be nonstationary.This contrast between the long stationarity period of the long room impulse response and the short stationarity period of the short term prediction speech model, and the corresponding number of data points which are available to identify each of them, is fundamental to the problem of acoustic feedback cancellation for public address systems.
In this paper, we introduce a new technique which estimates the speech model over short time windows (over which it is stationary), and the room impulse model over longer time windows (which is necessary because the number of parameters is much larger).The speech model is not required to be stationary during the complete room impulse response.Our scheme will also include a long time predictor which models the periodicity in w(k).We will show that this scheme outperforms existing methods.
This paper is organized as follows.In section 2, we introduce our new procedure.It uses alternating updates of the speech model and the adaptive filter which models the room.An important difference with [5] is that in our algorithm the speech model provides row transformations, which are then applied to pre-whiten the least squares system from which the room response estimate is computed.Hence the name 'prediction error method based adaptive filtering with row operations' (PEM-AFROW).In section 3, complexity figures are given, in section 4 we show simulation results, and section 5 contains the conclusion of the paper.

PEM-AFROW
It is instructive to first assume that w(k), the excitation sequence, is a white noise sequence.This means that we model the speech source signal as a time varying AR (TVAR) signal of order P, i.e.
Here b(k) accounts for energy variations in the excitation signal.Later on, we will use a more general model for w(k).We start from the minimization problem min f with we have an estimate Â(q, k) of H −1 (q, k) available at each time instant, with coefficients â(k) ∈ R P , we can apply a pre-whitening by forming the matrix It is important to note that each row in the matrix is shifted over one position compared to the previous row, hence that the second row has one zero in front of the transposed vector âT (k − 1) of dimension P + 1, the third row has two zeros, ... .We can now modify the minimisation problem (1) to We now introduce the assumption that h(k) is constant during frames of 20 msec.This means that we rewrite with e.g.L = 160 for a sampling rate of 8 kHz, and i = k/L , the first integer larger than k/L.This means that i is the frame index.
We now decouple the non-linear equations in order to also calculate the estimates âi and the room impulse response f (k) in an alternating fashion.In the first step, a previous estimate f (k) is used to filter a frame of data (20 msec).The filter output is subtracted from the corresponding microphone samples, resulting in Linear prediction is then performed on this d(k) (Levinson-Durbin) algorithm to find the linear prediction error filter âi .In Figure 2: PEM-AFROW-identification.In the first phase, â is estimated in the left hand side, it is then copied to the right hand side, where the estimation of f is performed on the same data frame.Finally, f is copied to the left hand side and used in the next frame.the second step, ( 2) is solved for f (k) with the updated (fixed) value for âi .This gives a better estimate f (k) for f (k).These two steps can be iterated on the frame.Since none of these two steps will increase ε{ e(k) } = ε{ Â(k)U(k) f (k) − Â(k)y(k) }, the algorithm will converge to a (possibly local) minimum of (2).
In order to reduce the complexity, we will perform only one iteration per frame.The minimization problem (2), with a fixed value of Â(k), can be solved for f (k) by means of any adaptive filtering algorithm.We have implemented this both using a QRDbased RLS algorithm and an NLMS algorithm.The input vector is in both cases Here u(k) = ( u(k) . . .u(k − N + 1) ) T .The desired signal input (right hand side sample) is a T i y(k) with y(k) = ( y(k) . . .y(k − P + 1) ) T .Since u(k) is a shifted version of u(k − 1) with one sample prepended, and a i remains constant during a frame of L samples, u w (k) will be a shifted version of u w (k − 1) with one sample prepended.So inside a frame, only one vector multiplication has to be performed to calculate u w (k).On the other hand, at the start of each frame, a matrix multiplication should be performed to calculate all elements of u w (iL) as follows : The identification algorithm is shown in Figure 2.For real time implementation, the scheme involves a delay of one frame for the update of f (k), since a i can only be calculated at time iL.Note that this is not a problem since we have assumed that the room impulse response is constant over more than one frame.The delay is effectively implemented as a delay line for the input samples u(k) before they are fed to equation (4).
Once the room impulse response has been identified, the next step is to insert the cancellation filter into the feedback loop scheme by setting f0 (k) = f (k), e.g. at regular time intervals (see Figure 1).It is important to notice that this obviously influences the adaptation.The input data used for the identification procedure then depend on the current model estimate, which is reminiscent of a non-linear optimization problem.This dependency is effectively ignored in our implementation (it is also ignored in adaptive control theory [6]).
Experiments indicate that updating the cancellation filter regularly is beneficial to the identification process.This can be explained because a time variant forward path (from microphone to loudspeaker) decreases the correlation between the loudspeaker signal and the speech source signal.
At this point, the difference between PEM-AFC and PEM-AFROW becomes obvious : in PEM-AFROW the stationarity of the speech model is explicitly assumed in the minimization problem by stating that âi remains constant during a frame (see equation ( 3)).At the start of each frame, the full input vector u T w (k) is recalculated.In PEM-AFC, this assumption of stationarity is not made for the optimisation problem itself (the optimisation is decoupled in two completely independent adaptive filters), and the full input vector is never recomputed after a change of a(k) in PEM-AFC, which can only be justified for short impulse responses.
For the TVAR-signals we studied up till now (where w(k) was a white noise sequence) , the pre-whitening step removes all of the correlation between the loudspeaker signal and the source signal.However, the excitation sequence w(k) for voiced speech is periodical (glottal excitation).Hence the input signal u(k) of the adaptive filter is -due to this periodicity -still correlated with the source signal, even after pre-whitening.
A standard approach in speech coding [7] is to cascade a short term predictor (STP) of order P (e.g.12) which models the vocal tract characteristics, with a long term predictor (LTP) with only one tap and a lag equal to the pitch period to model the periodicity, u lsw (k) = u sw (k) + b j u sw (k − M j ), j = k/L lt p .The LTP can be estimated in windows of 20 msec (which is the frame length L of the short term predictor), with a 10 msec overlap.This means that the LTP model is estimated each 10 msec, which corresponds to L lt p samples (at 8 kHz, L lt p = 80).In order to estimate the LTP, we mini- We can now estimate the one long term prediction filter tap b The variance of the long term prediction residual is This is evaluated for different values of M j,i = M min ...M max (the lag), and the parameters (M j , b j ) which result in the minimum value of E j are chosen as the predictor for long term prediction frame j.
It is important to note that by applying long term prediction, the actual order of the speech source model is the lag of the long term model plus the order of the short term model, and as stated in [5], to guarantee identifiability, the forward delay must be larger than the order of this model.In practice it does not matter too much where this forward delay is implemented : often a latency D is introduced by buffering after and before the A/D and D/A-converters, or even -due to the relatively low velocity of sound waves -from the distance between the loudspeaker and the microphone.
In section 2 it was mentioned that at frame borders, the whole input vector has to be recalculated by means of a matrix multiplication.It must be noted that when long term prediction is added to the algorithm, this matrix multiplication has to be performed not only at frame borders of the short term predictor, but also at frame borders of the long term predictor.

COMPLEXITY
The complexity is evaluated when the algorithm is operated with an NLMS adaptive filter.In these complexity expressions a multiplication and an addition are counted as two separate floating point operations.A 'search range' M min to M max has to be specified for the lag of the long term predictor (typically M min = 20, M max = 160 at 8 kHz).The complexity depends on these parameters through dM = M max − M min .For the complexity calculation we assume one tap long term prediction, and we also assume that the frames do not overlap.Since at each frame border the full NLMS input vector is recalculated, the complexity per sample is 8(N +P)+4dM +5+((2P+4)N +4P 2 −5P+15)/L floating point operations.The algorithm was implemented in C++ on a Pentium III, 1GHz PC without any specific optimization effort, and runs in real time with N = 2000, P = 12, L = 160 at 16 kHz sampling rate, with long term prediction overlap of 80 samples.In case of no overlap for the long term predictor, the number of floating point operations per second would be 272.10 6 .

SIMULATION RESULTS
In Figure 3 the error norm f (k) − f (k) is plotted as a function of time.Note that only the identification performance is shown, which means that the cancellation filter is not inserted into the scheme during adaptation.The signal is a sentence uttered by a male voice, the acoustic path has 1000 taps.We use NLMS for the adaptive filter, since in a practical implementation this would be the adaptive algorithm of choice (due to complexity constraints).Note that the performance of all algorithms is dependent of the energy ratio ('signal to noise ratio') of the loudspeaker component arriving on the microphone versus the source signal arriving on the microphone (the source signal should thus be interpreted as 'noise').The simulations shown here were done for one specific situation where this ratio was -11 dB, but experiments show a similar performance difference between the algorithms for other ratios.The short time prediction frame length is 160 samples, the long term prediction frame overlap is 80 samples, the minimum-and maximum lag for the long term predictor are 20 and 160 respectively.the sampling frequency is 8 kHz.The speech model order in PEM-AFROW and PEM-AFC is 12.The forward delay is 200 taps in both PEM-AFROW and PEM-AFC (note that the PEM-AFC version of [8] does not explicitly incorporate a forward delay, but the theoretical analysis of [5] shows that this is required for correct performance, hence we added it to the system).We also show the performance of PEM-AFROW with the long term predictor disabled, because PEM-AFC also does not use a long term predictor.
The NLMS step size is 0.01 for PEM-AFROW, while PEM-AFC, which uses a modified NLMS algorithm and hence a different definition of the step size, was tuned to give the same initial convergence speed.This allows us to make a fair comparison of the resulting bias/variance of the solution.Direct identification is seen to give poor results.PEM-AFC performance decreases with path length, and for N = 1000, its behaviour is only slightly better than direct identification behaviour (i.e. when the room impulse response is identified as if the system were operating in open loop).PEM-AFROW does perform well also for long paths.The bad performance of PEM-AFC is to be attributed to the stationary speech model assumption, which is not fulfilled for long paths.

CONCLUSION
We have introduced a new algorithm, referred to as PEM-AFROW, which allows for acoustic feedback cancellation in setups with long acoustic paths.It uses a speech source model with short-and long term prediction.Not only the howling phenomenon is suppressed but also the reverberation-like sounds, which become audible in the marginal stability region.The main differences with existing schemes are that our algorithm incorporates a long term prediction filter which removes periodicity in the short term speech signal residual, and that we do not assume stationarity of the speech signal over the length of the data window on which the acoustic path is identified.PEM-AFROW hence performs very well for long acoustic paths, while it is even slightly better than the existing methods for short path applications.Thanks to the low complexity, the algorithm can easily be implemented in real time.

Figure 3 :
Figure 3: PEM-AFROW with and without long term prediction versus direct identification for long paths (1000 filter taps).