Experimental analysis of optimal window length for independent low-rank matrix analysis

In this paper, we address the blind source separation (BSS) problem and analyze the optimal window length in the short-time Fourier transform (STFT) for independent low-rank matrix analysis (ILRMA). ILRMA is a state-of-the-art BSS technique that utilizes the statistical independence between low-rank matrix spectrogram models, which are estimated by nonnegative matrix factorization. In conventional frequency-domain BSS, the modeling error of a mixing system increases when the window length is too short, and the accuracy of statistical estimation decreases when the window length is too long. Therefore, the optimal window length is determined by both the reverberation time and the number of time frames. However, unlike classical BSS methods such as ICA and IVA, ILRMA enables the full modeling of spectrograms, which may improve the robustness to a decrease in the number of frames in a longer-window case. To confirm this hypothesis, the optimal window length for ILRMA is experimentally investigated, and the difference between the performances of ILRMA and conventional BSS is discussed.


I. Introduction
Source separation is a technique for estimating specific source signals from observed mixture signals.Many approaches have been developed for single-channel and multichannel observations.Blind source separation (BSS) in determined and overdetermined cases (number of channels ≥ number of sources) has been well studied so far [1]- [10].BSS does not require any prior information about the recording environment or the locations of sources or sensors.In particular, independent component analysis (ICA) [1] and its extensions, frequency-domain ICA (FDICA) [2]- [7] and independent vector analysis (IVA) [8]- [10], are the most popular methods for solving the BSS problem of audio signals.These methods exploit the statistical independence between specific sources and estimate a demixing matrix for the separation.For both ICA and IVA, fast and stable update rules, which are derived by an auxiliary function technique, have been proposed [11], [12].
As another means of solving audio source separation, nonnegative matrix factorization (NMF) [13], [14] is widely used for both blind and informed source separation [15]- [20].NMF is a parts-based low-rank decomposition and can extract some meaningful spectral patterns (bases) with their time-varying gains (activations) from an observed spectrogram.In [21] and [22], a multichannel extension of NMF (multichannel NMF: MNMF) was proposed, which clusters the decomposed bases and activations into each source using estimated spatial parameters.
Recently, a new BSS method that unifies IVA and NMF was proposed by the authors [23]- [25], which is called independent low-rank matrix analysis (ILRMA) in this paper 1 .Similarly to MNMF, ILRMA exploits the NMF decomposition of the estimated source spectrograms as a low-rank spectral model and optimizes the frequency-wise demixing matrix based on the independence between the spectral models.This NMFbased spectral model in ILRMA (low-rank matrix) can be interpreted as a natural extension of those in FDICA (scalar) and IVA (vector).
The separation result of all ICA-based frequency-domain BSS methods strongly depends on the length of the analysis window in the short-time Fourier transform (STFT).This is because the modeling error of a mixing system increases when the window length is too short, and the accuracy of statistical estimation decreases when the window length is too long (fewer time frames) [4], [26].However, unlike classical BSS methods such as ICA and IVA, ILRMA enables the full modeling of spectrograms, which may improve the robustness to a decrease in the number of frames in a longer-window case.In this paper, to confirm this hypothesis, we experimentally compare the optimal window lengths for FDICA, IVA, and ILRMA, and discuss the difference in their performances.

A. Formulation
Let N and M be the numbers of sources and channels, respectively.The complex-valued source, observed, and estimated signals are defined as and y are the integral indexes of the frequency bins, time frames, sources, and channels, respectively, and T denotes a transpose.We also describe the spectrograms of the source, observed, and estimated signals as S n ∈ C I×J , X m ∈ C I×J , and Y n ∈ C I×J , whose elements are s i j,n , x i j,m , and y i j,n , respectively.In FDICA, IVA, and ILRMA, the following mixing system is assumed: where is a frequency-wise mixing matrix and a i,n is the steering vector for the nth source.This mixing system is called a linear time-invariant mixture or the rank-1 spatial model [27].Thus, the estimated signal y i j can be obtained by assuming M = N and estimating the frequencywise demixing matrix where w i,n is the demixing filter for the nth source and H denotes a Hermitian transpose.The objective in FDICA, IVA, or ILRMA is to estimate both W i and y i j from only the observation x i j assuming the statistical independence between s i j,n and s i j,n ′ , where n ′ n.

B. FDICA and IVA
In FDICA [2]- [7], a robust BSS method for reverberant observations, ICA is applied to the frequency-wise signal (x i1,m , • • • , x iJ,m ) while assuming a non-Gaussian source distribution p(s) ≈ p(y).Since the permutation of the estimated signals at each frequency must be aligned, various permutation solvers have been proposed.IVA [8]- [10] is one of the most elegant solutions of the permutation problem.IVA formulates the frequency components as a vector x j,m = (x 1 j,m , • • • , x I j,m ) T and applies multivariate ICA to the vector signal ( x1,m , • • • , xJ,m ) to estimate the frequency-wise demixing matrix W i , where the source vector s j,n = (s 1 j,n , • • • , s I j,n ) T is assumed to have a spherical I-dimensional non-Gaussian source distribution p( s) [10].This spherical property ensures higher-order dependences among the frequency components in s j,n , thus avoiding the permutation problem.

C. ILRMA
ILRMA extends the source model p( s) in IVA to the following time-varying distribution: where the local distribution p(y i j,n ) is defined as a circularly symmetric (isotropic) complex Gaussian distribution, namely, the probability of p(y i j,n ) only depends on the power of the complex value y i j,n .Also, r i j,n is a time-frequency-varying nonnegative variance and corresponds to the expectation of the power of y i j,n , namely, E[|y i j,n | 2 ].This is because p(y i j,n ) is isotropic in the complex plane.Since the variance r i j,n can fluctuate depending on the time frames, (3) becomes a non-Gaussian distribution.The negative log-likelihood function L based on (3) can be obtained as follows by assuming the independence between each source and each time frame: ILRMA applies Itakura-Saito-divergence-based NMF (IS-NMF) to Y n .In ISNMF [28], the decomposition y i j,n = ∑ l c i j,nl is assumed, where l = 1, • • • , L is the integral index and L is set to a much smaller value than min(I, J).The components c i j,nl are assumed to be mutually independent and obey  where t il,n and v l j,n are the basis and activation, respectively, and . Because of the reproductive property of ( 5), y i j,n (= ∑ l c i j,nl ) obeys (3) with the variance r i j,n = ∑ l t il,n v l j,n .This fact means that the additivity of the power spectrogram holds in an expectation sense [28], which provides a justification for decomposing the power spectrogram.Therefore, the power spectrogram of the estimated source is approximately decomposed with a fixed number of bases and activations as |Y n | .2≈ T n V n , where the absolute value and the dotted exponent for a matrix denote an element-wise absolute and exponent, respectively, and T n ∈ R I×L ≥ 0 and V n ∈ R L×J ≥ 0 are the basis and activation matrices for the nth source, respectively.The estimation of W i , T n , and V n can consistently be carried out by minimizing (4) in a fully blind manner.Note that ILRMA is theoretically equivalent to conventional MNMF only when the rank-1 spatial model is assumed, which yields a stable and computationally efficient algorithm for ILRMA.This issue and the convergence-guaranteed fast update rules for W i , T n , and V n can be found in [25].

A. Motivation
In the practical use of frequency-domain BSS, the length of the analysis window in STFT directly affects the separation performance.For instance, a decrease in performance for shorter-or longer-window cases in FDICA was reported in [4].When the window length is too short, the separation fails because the mixing assumption (1) does not hold owing to the reverberation.In contrast, when the window length is too long, the statistical estimation in ICA fails because the number of time frames J decreases.IVA and ILRMA also suffers from this problem because they obviously cannot estimate the demixing matrix W i when J = 1.However, the full modeling of the I × J spectrogram in ILRMA may improve the robustness to a decrease in the number of frames in a longer-window case (fewer time frames).In this section, we experimentally compare the optimal window lengths for FDICA, IVA, and ILRMA and discuss the difference in their performances.

B. Dataset and Experimental Conditions
In this experiment, we used four music and four speech observations, as shown in Table I, where each observation includes two sources.These dry sources were obtained from professionally produced music and underdetermined separation tasks in SiSEC2011 [29].To simulate the reverberant mixture, the observed signals were produced by convoluting the impulse response E2A (T 60 = 300 ms) or JR2 (T 60 = 470 ms), which was obtained from the RWCP [30], with each source.Fig. 2 shows the recording conditions of the impulse responses.Note that all the separation tasks are determined, namely, N = M = 2.
We compared three BSS methods, namely, FDICA, IVA, and ILRMA.For FDICA, two blind and ideal permutation solvers were employed and compared: FDICA+DOA and FDICA+IPS.FDICA+DOA solves the permutation problem by clustering the components using the relative locations of microphones and the estimated direction of arrival (DOA) [3], and FDICA+IPS utilizes the reference (oracle) source spectrograms S n to align the permutations, which is an ideal permutation solver (IPS).All the optimizations in FDICA, IVA, and ILRMA were based on an auxiliary function technique [11], [12], [25].The other experimental conditions are described in Table II.As an evaluation score of the separation performance, we used the improvement of the signal-to-distortion ratio (SDR) [31].

C. Comparison Using Ideal Initialization
To compare the net separation ability for each setting of the window length, in this subsection, the initial values of the spatial and spectral parameters in each BSS method are set to their ideal values.For the spatial parameter, the initial demixing matrix W (initial)   i is set to its optimal value which gives the best separation performance under the linear mixing assumption (1).In addition, only for ILRMA, sourcewise initial basis and activation matrices, T (initial)   n and V (initial)   n , are pretrained by ISNMF using the oracle power spectrogram given by (  where D IS (•∥•) is the element-wise Itakura-Saito divergence.Therefore, in this experiment, FDICA+DOA and IVA are based on the spatial oracle initialization, and FDICA+IPS and ILRMA are based on the spatial and spectral oracle initialization.Since the separation performance is obviously maximized for the initial parameter given by ( 6), this experiment illustrates how the performance decreases at the converged solution for the model used in each method.
The results are shown in Figs. 3 and 4, where the scores are averaged over the observed signals with the same impulse response.As already mentioned in Sect.III-A, the separation with a shorter window is highly limited in all the methods because the assumption of a linear mixture model (1) collapses (the reverberation time exceeds the window length).For the longer-window case, the performance of FDICA and IVA deteriorates when the length exceeds 2T 60 even if the oracle source spectrogram is employed in FDICA+IPS.This instability in the statistical estimation is caused by the insufficient number of time frames J [4].On the other hand, for the music signals (Fig. 3), ILRMA maintains its separation accuracy even for windows longer than 1 s.This is a benefit of employing the full modeling of time-frequency dependences, and the robustness to fewer time frames is improved by the low-rank spectrogram modeling.From this result, we can confirm that a longer window length exceeding 2T 60 is preferable for music source separation using ILRMA, whereas FDICA achieves the highest performance when the length is set to less than 2T 60 .However, this behavior does not appear in the results for speech signals (Fig. 4).This is because the low-rank assumption in ILRMA does not apply to the speech signals, and the spectral model cannot capture the precise source spectrogram during the optimization.
Since the NMF parameters are pretrained using (7), an increase in the number of bases directly improves the accuracy of the spectral model T n V n and the separation performance of ILRMA.This means that improving the precision of the spectral model will provide a better estimation of W i , as predicted in [10].

D. Comparison Using Random Initialization
In this subsection, the separation performance in a practical situation is compared for various window lengths.The initial demixing matrix W (initial)   i was set to the identity matrix in all the methods, and the initial NMF matrices T (initial)   n and V (initial)   n were set to nonnegative uniform random values.Therefore, FDICA+DOA only utilizes the knowledge of the microphone spacing, FDICA+IPS still exploits |S n | .2 for IPS, and the other methods are fully blind.
The results are shown in Figs. 5 and 6.In this experiment, ILRMA cannot maintain its accuracy for longer windows, and the optimal length in ILRMA is almost the same as those in FDICA and IVA.This means that the blind estimation of a precise spectral model is a difficult problem, and the robustness of ILRMA against fewer time frames is deteriorated.
The number of bases L does not strongly affect the performance in the music separation task (Fig. 5).For the speech signals (Fig. 6), as reported in [25], a small number of bases is preferable, whereas spectrograms of speech signals do not have the low-rank property.For speech signals, the estimation of T n V n using a large number of bases always fails to capture the precise source spectrograms |S n | .2 because of the difficulty in optimization, and a rough and broad spectral model with a small number of bases can stably separate the speech sources.
Since FDICA+IPS achieves high separation accuracy even for speech signals, we have significant scope to improve speech BSS using the linear mixing model (1), which yields a computationally efficient solution.The blind capture of complicated (not low-rank) spectrograms requires another criterion, such as sparseness or time-varying speech structures, which can be considered as a further study.

IV. Conclusion
We presented an experimental analysis of optimal window lengths for FDICA, IVA, and ILRMA.Since ILRMA employs not only the independence between sources but also a timefrequency structure for the estimation of a demixing matrix, the robustness to long windows (fewer time frames) can be improved.However, in a practical situation, the optimal window length of ILRMA was similar to that in IVA or FDICA, which shows the difficulty of the blind estimation of a precise spectral model in ILRMA.

Fig. 1 .
Fig. 1.Conceptual model of ILRMA, where xm and ỹn are time-domain signals of X m and Y n , respectively.

Fig. 1
shows the conceptual model of ILRMA.When original sources have a low-rank spectrogram |S n | .2, the spectrogram of their mixture |X m | .2should be more complicated, namely, the rank of |X m | .2 will be greater than that of |S n | .2 .On the basis of this assumption, in ILRMA, the low-rank constraint for each estimated spectrogram |Y n | .2 is introduced by employing NMF.The demixing matrix W i is estimated so that the spectrogram of estimated signal |Y n | .2becomes a low-rank matrix modeled by T n V n , whose rank is at most L.

TABLE I Music
and speech sources obtained from SiSEC2011