Orthogonally constrained independent component extraction: Blind MPDR beamforming

We propose a novel technique for the extraction of one independent component from an instantaneous linear complex-valued mixture of signals. The mixing model is optimized in terms of the number of parameters that are necessary to simultaneously estimate one column of the mixing matrix and one row of the de-mixing matrix, which both correspond to the desired source. The desired source is assumed to have a non-Gaussian distribution, while the other sources are modeled, for simplicity, as Gaussian-distributed, although in applications the other sources can be arbitrary. We propose an algorithm that can be interpreted as a blind self-steering Minimum-Power Distortionless Response (MPDR) beamformer. The method is compared with the popular Natural Gradient algorithm for general Independent Component Analysis. Their performances are comparable but the proposed method has a lower computational complexity; in examples, it is about four times faster.


I. INTRODUCTION
Consider the following complex-valued instantaneous mixture model that describes a multi-channel mixing problem with one target signal within a frequency bin [1].The model reads

x(t) = as(t) + y(t)
or where the former is a vector-symbolic description while the latter describes a batch of data; t is a time or frame index which will be omitted for brevity; x = [x 1 , . . ., x d ] T represents d observed signals on sensors; s represents a target signal and a is the corresponding d × 1 steering vector; y = [y 1 , . . ., y d ] T represents noise signals that are independent of s.The upper-case bold letters such as X and Y will denote matrices whose columns contain concrete samples of x and y, respectively; let the number of samples be N where N d; s is the 1 × N row vector of samples of s.All signals are assumed to have zero mean.
In array processing literature [2], an optimum extractor of s is the Minimum Power Distortionless Response beamformer (MPDR) defined trough This work was supported by The Czech Science Foundation through Project No. 17-00902S and partly by California Community Foundation through Project No. DA- 15-114599.where the output signal is w H x. The solution of ( 2) is where ] is the covariance of x; E[•] stands for the expectation operator.The expression can be used in practice by replacing C x by its sample-based estimate C x = XX H /N .However, the steering vector a must be known.
Various models have been proposed to identify a from x, see, e.g.[3].Most of recent attempts come from Blind Source Separation (BSS) based on Independent Component Analysis (ICA) [4], [5].In ICA, it is assumed that (1) can be written as a determined mixture where A generalization to ICA is Multidimensional ICA (MICA) [6], which enables the components of u 2 to be dependent.Blind Source Extraction (BSE) is a class of methods that aims to extract one signal of specific properties (not necessarily an independent signal); see, e.g., [7].ICA, MICA and BSE are bases of further extensions as is Independent Vector Analysis (IVA) [8] or Joint BSS (JBSS) [9]; see also [10] for further extensions and perspectives.
Since the role of s can be interchanged with any independent component of y, a priori knowledge is needed in order to separate the desired source.Typically, information is given in the form of a rough estimate of a, e.g., based on an approximate direction-of-arrival (DOA) of the target source on the sensors [11].An ICA (MICA) algorithm can be then initialized so that it converges to the desired solution.
In this paper, we focus on this situation when an initial guess for the steering vector is given, and the nearest independent source should be found.ICA as well as MICA appear to be too excessive for this goal, because they also aim to separate y.Therefore, in Section II, we propose a novel mixing model where the separation of y is not the goal.We will refer to it as to Independent Component Extraction (ICE) 1 .
The rest of the paper is organized as follows.In Section III, an initialization-dependent algorithm for the extraction of a nonGaussian target source is derived based on maximum likelihood principle.While s is assumed to be nonGaussian, y is modeled as a mixture of correlated Gaussian signals.
The algorithm is faster than the popular Natural Gradient algorithm by Amari et al. [13], which is shown by simulations in Section IV.Section V concludes the paper.

II. MIXING MODEL PARAMETRIZATION A. Mixing and De-mixing Matrices
The model (1) involves the scaling ambiguity that the couple (s, a) can be replaced by (αs, α −1 a) for any α = 0. Without any loss of generality, we can cope with this problem by putting a 1 , the first element of a, equal to one; let us denote2 a = [1; g]. ( This is equivalent to considering s as the signal of the target source as it is received by the first sensor.In multi-microphone systems, the elements of g are, after this normalization, referred to as Relative Transfer Functions (RTFs); see [14].
The ICE model describes (1) as a determined mixture, so the subspace spanned by [s; y] must have dimension d.In other words, y is assumed to be coherent so that it can be fully separated from x. Hence, y can be expressed as Let the mixing and the de-mixing descriptions of ( 1) with square matrices be, respectively, where v = [s; z].The structure of the mixing matrix is therefore A = [a, Q].Let the de-mixing matrix be partitioned as W = [w H ; B].The role of w is the separating vector that extracts s from x.By contrast, B extracts z while it blocks out s, so its role is the same as that of "Blocking Matrix" in array processors [2], [14].This property is ensured whenever B is orthogonal to a, so we put B = [g, −I d−1 ], by which the definition of z is completed, i.e., z = By = Bx; we will refer to z as to the background signals.
Since w H x = s, the vector w represents a distortionless beamformer satisfying constraint By parameterizing the vector as w H = [β, h H ], it follows from ( 5) and ( 7) that β = 1 − h H g. Now, by stating that WA = I d , the concrete parametrization of the mixing and the de-mixing matrix, respectively, is and It should be emphasized that the ICE model is minimized for the problem of separating s from x.Although we do not care about the form in which y is separated from x, the extracted counterpart of y, the background signal z, is fully specified.The mixing model has only 2d − 2 parameters represented by the elements of g and h.
This is in contrast with MICA, which is highly ambiguous in that Q and z can be arbitrary but y = Qz.In ICA, the elements of z should be independent, which is less ambiguous as compared to MICA but still ambiguous and limiting as compared to ICE.

B. Orthogonal Constraint
Some ICA and MICA algorithms impose orthogonal constraint (OG) on the separated signals [15].The separated signal subspaces are constrained to be orthogonal to ensure their uncorrelatedness.The constraint reduces the number of ambiguous parameters in the mixing model.
Let W denote a de-mixing matrix estimate and be concrete samples of the estimated de-mixed signals.The OG requires that where To simplify the notation, we do not explicitly write the dependence of B, w, a, etc. on g and h.
Once OG is applied, there is a coupling between g and h, which reduces the number parameters from 2d − 2 to d − 1.
Let the free variable be g.It can be shown that g and h are coupled through By comparing ( 12) with (3), w OG can be interpreted as an approximate MPDR beamformer where C x is replaced by C x , and a is replaced by its current estimate given by [1; g].

III. PROPOSED ALGORITHM
In this paper, we will focus on ICE constrained by the OG.The task can be formulated as follows.
Definition 1.The goal of the orthogonally constrained ICE is to find a, or, equivalently g, such that s = w H OG X is as independent of its orthogonal complement Z = BX as possible.
There are several statistical models in ICA relying on one or more signals' properties such as nonGaussianity, nonstationarity or nonwhiteness [16].In this paper, we focus on the nonGaussianity-based model where all signals are modeled as i.i.d.sequences.We will constrain to a particular case where 1) the target signal s is drawn from a nonGaussian distribution, while 2) the background signals z are circularly-symmetric Gaussian.The latter assumption is not that restrictive as it might seem.In the ICE de-mixing model, z = Bx correspond to "mixed" and correlated signals (up to very special cases).Due to the Central Limit Theorem, mixed signals tend to be more Gaussian [17].Moreover, the multivariate Gaussian distribution is the simplest statistical model of correlated signals.

A. Maximum Likelihood Principle
Let p s (ξ 1 ) and p z (ξ 2 ) denote the probability density functions of s and z, respectively; ξ 1 and ξ 2 denote free variables of appropriate dimensions.Using the independence assumption, the joint pdf of s and z is where ξ = [ξ 1 ; ξ 2 ], and p z (ξ 2 ) ∼ CN (0, C z ).From ( 6) and ( 9), the joint pdf of x is For the measured signals X, the log-likelihood function, as a function of the model parameters g and h, reads where x i denotes the ith column of X, that is, the ith sample of x, and C is a constant independent of g and h.By applying the OG and replacing C z by C z = Z Z H /N , we introduce a contrast function for g as The gradient of ( 19) with respect to g H is where is the score function corresponding to p s (•).The last term in (20) can be simplified according to where e 1 denotes the first column of I d , and s = w H OG X.The fact that Z s H = 0 follows from the OG (11), and the last equation follows from (8).

B. Gradient Ascent Algorithm
In practice, the pdf of the target signal is not known, so, in ICA, the score function ( 21) is replaced by an appropriate nonlinearity φ(•).For ICE, an additional requirement is that φ satisfies that It could be easily verified that for φ = ψ s , this condition is satisfied.By inspecting (20) for N → +∞, it is seen that the gradient is zero when a is the true steering vector, so the optimum solution is the extreme of the contrast function.To preserve this key property even when φ = ψ s , the condition (25) must be fulfilled.Since (25) cannot be ensured without knowing p s , we propose to normalize the selected nonlinearity φ so that the sample-based moment where the scalar function φ is applied element-wise.Now, using (24) and ( 26) and the fact that E{λ a C −1 x a} = h, the gradient (20) Based on this, we propose the gradient ascent algorithm described in Algorithm 1, which will be referred to as OGICE.
The method starts from an initial guess g ini and performs small steps in the direction of the gradient (27).The length of the step is controlled through a positive constant μ.In each step, the nonlinearity φ is normalized to satisfy (26).The algorithm stops when the norm of the gradient is smaller than a selected tolerance.

C. OGICE vs. (Natural) Gradient-Based ICA
Consider a special case where C x = I d (which can be achieved by de-correlating the input data X as in Algorithm 2).Then, w OG = a/ a 2 and (27) can be written in the form This reveals an interesting similarity between OGICE and the method by Bell and Sejnowski [18] for the classical ICA formulation, whose popular improvement is known as the Natural Gradient algorithm [13], [19].The update rule of [18] for the (ICA) de-mixing matrix is After the conjugate transpose and by denoting W −1 = A, the update rule is Now, the right-hand side of (28) corresponds to any column of the right-hand side of (30).Nevertheless, there are principal differences between the related algorithms.In (30), the update of any row of W is influenced by the other rows (the estimation of one component influences that of the others), because A = W −1 .Next, (28) is used under the condition (26) and when the data are de-correlated, which is not required by the ICA algorithm using (30).

IV. EXPERIMENTAL RESULTS
In simulations, we compared OGICE with the popular Natural Gradient (NG) algorithm from [13].Both methods were initialized using the true steering vector perturbed by random errors taken from CN (0, σ 2 ).In NG, the initial demixing matrix consists of the first row equal to (12) while the other rows are selected as in (9).In both methods, two nonlinearities φ(•) were tested.The first is tanh(•), which was employed by many ICA methods [4].The second nonlinearity is sign(x) = x/|x|, which corresponds to the score function of the circularly-symmetric complex Laplacian distribution.To compare, we consider also the performance of the approximate MPDR beamformer (using sample-based covariance matrix) (12) where a is either equal to the true value (oracle) or to its perturbed value (initial).
The target signal was generated from the latter distribution with variance one.The (interference) signals were generated as circularly-symmetric Gaussian CN (0, I d−1 ).The elements of the mixing matrix A were generated independently from CN (1, 1/4) (the mean value is one to simulate similarly distributed signals over the sensors); we set (A) 11 = 1.
The signals were mixed according to X = AS.This corresponds to (1) where a is equal to the first row of A while the covariance of the noise signals y is C y = A 2 A H 2 , where A 2 consists of columns 2, . . ., d of A. In each trial, the signals were generated, mixed and separated by the algorithms.The separated signal was evaluated in terms of Signal-to-Interference Ratio (SIR).Fig. 2 shows the SIR averaged over 10 3 trials, respectively, for varying d, N , and σ when the other parameters were fixed to d = 10, N = 10 3 and σ = 0.1; μ = 0.01 was used in OGICE as well as in NG.
The influence of the nonlinearity φ(•) on SIR is significant.OGICE as well as NG yields better SIR when φ(•) corresponds to the true score function sign(•) of the target signal distribution.NG appears to be more sensitive to the choice of φ(•).While it achieves almost superior results with sign(•), it performs worst with tanh(•).The fact that the approximate MPDR with oracle steering vector can be outperformed by OGICE as well as by NG (with sign(•)) points to the appealing potential of the blind estimators.The dependence of SIR on N reveals that the accuracy of OGICE is limited when N → +∞.This is well known phenomenon caused by the orthogonal constraint [15], [20].NG does not apply the constraint, so its accuracy grows faster with N .On the other hand, the results for N < 10 3 slightly promote OGICE prior to NG.
The results for varying σ reveal the radius of the region of convergence where the initial value leads to the separation of the desired signal (without random permutation in case of NG).Interestingly, NG appears to be robust in this example with sign(•) but less robust than OGICE with tanh(•).
Last but not least, Fig. 1 shows the average computational time per one iteration as functions of d and N .In these examples, OGICE is approximately four times faster than NG (performed in Matlab on a PC with 4-core 2.7 GHz processor, 8 GB RAM).The algorithms need approximately the same number of iterations to converge.

V. CONCLUSIONS
The novel mixing model of ICE has been introduced, which contains the minimum number of parameters needed to handle the separation of one source of interest.A novel OG-constrained algorithm, OGICE, was proposed based on the mixing model.The method is simpler than NG as it does not aim to decompose the rest of the mixture into independent signals.
eliminates the first element of vector-valued argument, {•} is the real part of the argument, x 1i is the first element of x i , and ψ s (ξ 1 ) = − ∂ log p s (ξ 1 )