Statistical efficiency of structured cpd estimation applied to Wiener-Hammerstein modeling

The computation of a structured canonical polyadic de-composition (CPD) is useful to address several important modeling problems in real-world applications. In this paper, we consider the identification of a nonlinear system by means of a Wiener-Hammerstein model, assuming a high-order Volterra kernel of that system has been previously estimated. Such a kernel, viewed as a tensor, admits a CPD with banded circulant factors which comprise the model parameters. To estimate them, we formulate specialized estimators based on recently proposed algorithms for the computation of struc-tured CPDs. Then, considering the presence of additive white Gaussian noise, we derive a closed-form expression for the Cramer-Rao bound (CRB) associated with this estimation problem. Finally, we assess the statistical performance of the proposed estimators via Monte Carlo simulations, by comparing their mean-square error with the CRB.


INTRODUCTION
The canonical polyadic decomposition (CPD), which can be seen as one possible extension of the SVD to higher-order tensors [1], is by now a well-established mathematical tool utilized in many scientific disciplines [2].As it requires only mild assumptions for being essentially unique, the CPD provides means for blindly and jointly identifying the components of multilinear models, which arise in many real-world applications; see [1][2][3] for some examples.
In practice, the data tensor to be decomposed is always corrupted by noise.Therefore, the assessment of the statistical performance of CPD computation algorithms via comparison with the Cramér-Rao bound (CRB) [10] is of practical interest, since it can guide the choice for an appropriate algorithm in application domains.Furthermore, it can provide valuable information for the study and development of such algorithms.For the unstructured CPD, [11] has derived the associated CRB and presented an evaluation of the popular alternating least-squares (ALS) algorithm for tensors of orders three and four.Regarding the structured case, the CRB for the estimation of a complex CPD with a particular Vandermonde factor has been derived in [12], motivated by the problem of estimating the directions of arrival of multiple source signals.Also, [13] has provided a closed-form expression for the CRB associated with the estimation of a CPD having Hankel and/or Toeplitz factors.
This paper addresses the statistical evaluation of algorithms specialized in computing a CPD having banded circulant factors when applied to estimate the parameters of a Wiener-Hammerstein (WH) model, which is a well-known block-oriented model used for representing nonlinear dynamical systems [14].Because many systems of practical relevance can be (approximately) described by the WH model, the problem of identifying its parameters from a set of experimental data (i.e., measured input and output samples) is well-studied; see, e.g., [14] and references therein.One possible approach, as described in [5], consists in estimating the WH model parameters by computing the structured CPD of a kernel of its equivalent Volterra model.Here, we derive a closed-form expression for the CRB associated with this estimation problem, assuming the availability of a previously identified Volterra kernel corrupted by white Gaussian noise.Then, we formulate specialized estimation algorithms based on the CP Toeplitz (CPTOEP) and circulant-constrained ALS (CALS) methods proposed in [8,9] and evaluate their performance by comparing their mean-square error with the CRB through Monte Carlo simulations.
Notation.Scalars are denoted by lowercase letters, e.g.θ i or a ij , vectors by lowercase boldface, e.g.θ or a j , matrices by boldface capitals, e.g.B or A (p) , and higher order arrays by calligraphic letters, e.g.X.We use the superscripts T for transposition, † for pseudo-inverse, ⊠ and ⊙ denote the Kronecker and Khatri-Rao products, respectively, and ⊗ stands for the (tensor) outer product.The shorthand a ⊠p denotes a⊠. ..⊠a,where a appears p times; a ⊗p and A ⊙p are defined analogously.For our purposes, a tensor X of order P will be assimilated to its array of coordinates, which is indexed by P indices.Its entries will be denoted by x i1,...,iP .

WIENER-HAMMERSTEIN MODEL
IDENTIFICATION VIA CPD

Tensors and the CP decomposition
The polyadic decomposition of a P th-order tensor is defined by where a r is the r th column of A (p) ∈ R Ip×R .The minimal value of R such that X can be written as in ( 1) is called the rank of X, in which case we refer to the above decomposition as the CPD of X.Another way of expressing ( 1) is where I ∈ R R×•••×R is the P th-order identity tensor and × p denotes the p-mode product (see, e.g., [2, Sec.2.5]).

The WH model and its equivalent Volterra model
The structure of a discrete-time WH model is as depicted in Fig. 1.Basically, it consists of a cascade connection comprising a memoryless nonlinearity g(•) "sandwiched" by two linear systems, W (z) and H(z).Because of its structured form constituted by fundamental blocks, the WH model is said to belong to the class of the block-oriented models [14].
In this paper, we consider the time-invariant WH model constituted by a polynomial nonlinearity of the form g(x) = P p=1 g p x p and by the finite impulse response filters After some manipulation, this relation can be put in the equivalent Volterra model form with

CPD-based WH model identification
We now describe the WH model identification approach proposed in [5], which involves computing the CPD of a symmetric high-order Volterra kernel.We start by noting that, being a function of multiple discrete indices, any pth-order symmetric Volterra kernel k (p) of memory M can be uniquely identified with a pth-order symmetric tensor . Owing to its convolutive form involving separable terms, the kernel (3) can be identified with the tensor where S r [e r . . .e Lw+r−1 ], with e m denoting the mth canonical basis vector of R M , and w = [w 0 . . .w Lw−1 ] T .Expression ( 4) is a symmetric CPD that can also be written as with I denoting the identity tensor, Note that the choice of which factor is postmultiplied by diag(h) is irrelevant, due to the scaling indeterminacy.We thus conclude that the WH model (2) has equivalent symmetric Volterra kernels whose CPD are constituted by circulant factors C and a factor of the form g p C diag(h), which absorbs the scaling coefficients g p and h r .
As the factors in (5) contain the parameters of the WH model (2), the above observations suggest the following twostep procedure for its identification: (i) estimate k (p) from an available set of input/output samples, using some Volterra kernel identification method (as, e.g., [16]), and (ii) compute the structured CPD from the associated symmetric tensor X.Note that this requires choosing some p ≥ 3, since otherwise the model is not identifiable: for p = 1, it is a vector containing sums of products of coefficients g p , h r and w l ; for p = 2, we have a bilinear decomposition, which is only unique under restrictive assumptions (such as orthogonality).Henceforth, we assume that (i) has been accomplished and focus on the problem of estimating the parameters from the CPD of X.

Formulation of estimation problem
Let us consider that a pth-order tensor has been constructed from an estimated kernel k (p) as described in the previous section.In practice, it is evident that such a tensor satisfies Y = X + N, where N is an error tensor accounting for the inevitable uncertainties which arise in the data-driven kernel estimation procedure.
where Φ(h) is given by the term between brackets, in which Φ r = S ⊠p r , and f (w) = w ⊠p .Our problem can therefore be expressed as that of estimating the parameters g p , w and h of the WH model from observations which satisfy y = ΨΦ(h)f (w) + n.We assume that the random vector n has zero-mean i.i.d.components drawn from the Gaussian distribution with variance σ 2 .

Identifiability
Due to the inherent scaling indeterminacy of our model, its local identifiability is only guaranteed with further assumptions.To eliminate this indeterminacy, we assume h = 0 and w 0 = g p = 1, which is sufficient due to the model structure.Note that this entails no loss of generality, as the other coefficients w l and g q can be rescaled accordingly.Defining now w such that w = [1 wT ] T , we can write the parameter vector of the WH model as η = [ wT h T ] T ∈ R M .Global identifiability, on the other hand, is related to the uniqueness of the structured CPD.As the k-rank [2,3] of C equals R, uniqueness follows from Kruskal's condition [2, Sec.3.2] if h 0 = R (which implies that the k-rank of C diag(h) is 1 The ordering of the rows of Ψ is of no consequence for our purposes.R) and R ≥ 2. If h 0 < R, then the k-rank of C diag(h) equals zero; in this case, Kruskal's condition is only met for P = 4 if R ≥ 3 and for P ≥ 5 if R ≥ 2.

Parameter estimation algorithms
In this section, we briefly review two methods that can be used to estimate the parameters η of a model of the form (4).

Circulant-constrained ALS algorithm
The first method consists of a specialization of the wellknown alternating least squares (ALS) algorithm in which the factor matrices of the CPD are constrained as in (4).In the case of a CPD involving only circulant factors, such strategy has already been followed in [9], leading to the CALS algorithm.Here, we adapt that algorithm for our purposes.
Initially, we define E l [e l . . .e R+l−1 ] ∈ R M×R , for l ∈ {1, . . ., L w }, and E [vec (E 1 ) . . .vec (E Lw )].With these definitions, we have vec(C) = Ew.Next, we note that any flat matrix unfolding of Y can then be written as where the above approximation is due to the presence of noise and the operator unvec R is defined such that, for every vec- Vectorizing both sides and using the property vec(A diag(b)D) = (D T ⊙ A)b, we have also vec(Y) ≈ (C ⊙p ) h.Hence, given current estimates ŵk and ĥk , we can update them with the scheme where Ĉk = [S 1 ŵk . . .S R ŵk ] and Ŵk = ( Ĉk ) ⊙p−1 .Note that, to derive these update equations, we have used the fact that As stopping criteria, one can check if either the relative difference between two consecutive values of the reconstruction error falls below some fixed threshold ǫ Y > 0 or a maximum number of iterations K max is attained.

CPTOEP algorithm
Since the objective is multimodal, the main goal is to find a good approximation of the solution by using a low-complexity algorithm.In [8], non-iterative procedures have been proposed, which are able to compute the exact CPD when matrix factors are banded or structured.Consider a matrix unfolding of Y under the form: Ỹ ≈ (C (1) ⊙C (2) )A T , where the structure of A is ignored, and where C (n) are assumed Toeplitz circulant of same size M × R, that is, they can each be expressed in the orthonormal basis {E ℓ , 1 ≤ ℓ ≤ L w } defined in Section 3.3.1: Let Ỹ = UΣV T denote the SVD of Ỹ.Then there exists a matrix N such that UN = (C (1) ⊙ C (2) ) and N −1 ΣV T = A T .Following the lines of [8], one can find matrix N and coefficients j by solving a linear system of M2 R equations in L 2 w + R 2 − 1 unknowns.If there are more equations than unknowns and if the system has full rank R, the solution (N, Z) is unique.First, coefficients c (1) i and c (2) j are obtained from the best rank-1 approximation of matrix Z, which eventually yields estimates Ĉ(1) and Ĉ(2) .Next, we calculate Ĉ = ( Ĉ(1) + Ĉ(2) )/2, and the estimate of h is obtained as in stage (iii) of the CALS algorithm.
The algorithm described above is suboptimal for several reasons: (a) the model is noisy, (b) the 3 factor matrices are assumed to be independent, whereas they are not, and (c) the structure of the 3rd matrix, A, is ignored.Hence the solution obtained will be inaccurate, but can be easily refined by a quasi-Newton algorithm, as will be subsequently shown.

Closed-form expression for the CRB
If we assume that η contains parameters associated with a system of interest, we have that the (vectorized) measured kernel satisfies y ∼ N (x, σ 2 I I ), where σ 2 denotes the variance of the elements of n.Hence, the mean-square error (MSE) of any locally unbiased estimator η(y) satisfies , where the CRB matrix B(η) can be computed by applying the Slepian-Bangs formula, which yields [13] where J(η) ∈ R I×M is the Jacobian matrix given by From (6) and the definition of f , we have in which z l ( w) = p q=1 w ⊠q−1 ⊠ e l+1 ⊠ w ⊠p−q (with the convention w ⊠0 = 1).To derive J(h), we first apply the property vec(ABD) = (D T ⊠ A)vec(B) to write In order to identify the contribution of w and h in CRB( wk ) and CRB(h r ), we propose to extend the results presented in [13] by using oblique projection.This is the purpose of the following proposition.Denote E AB the oblique projection whose range is A and whose null space contains B (see [17] for details).

Proposition 3.1
The closed-form expression for the CRB of wk is given by: where g k is the kth column of J( w) and G k is the submatrix of J( w) obtained by removing its kth column.Similarly, the closed-form expression for the CRB of h r is: where d r is the rth column of J(h) and D r is the submatrix of J(h) obtained by removing its rth column.
The proof is omitted due to the lack of space.

SIMULATION RESULTS
To illustrate the utility of the derived CRB, we now present some Monte Carlo simulation results.Specifically, we evaluate several estimators when applied to identify a WH model with parameters w T = [1 0.538 1.834 -2.259 0.862] T , h = [1.594-6.538 -2.168] T from estimates of the equivalent symmetric third-order kernel X, proceeding as follows.
For each realization of the (symmetric) noise tensor N, we vary σ 2 and then construct a data tensor Y = X + N for each chosen level of σ 2 .Next, we compute estimates η(y) given by: (i) the family of estimators N -CALS, which consist in applying N times the algorithm of Section 3.3.1 with random initializations and keeping the best solution in terms of reconstruction error (w.r.t.Y); (ii) the estimator CPTOEP, described in Section 3.3.2;(iii) the estimator CPTOEP-CALS, which corresponds to refining the CPTOEP estimate by applying the CALS algorithm; (iv) the estimator CPTOEP-BFGS, in which a similar refinement is obtained by minimizing a least-squares criterion (w.r.t.Y) with the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm 2 [18].The maximum  1, as well as the computed values of the CRB.One can see that 1-CALS has a very poor performance, due to its frequent premature termination or inability to converge.Although 5-CALS performs better, its results are degraded for the same reasons.CPTOEP, in its turn, performs slightly worse than 10-CALS, but attains a similar level when refined by CALS.Yet, there remains a gap between their MSE curves and that of the CRB.Indeed, only the CPTOEP-BGFS estimator attains an MSE close to the CRB.Note that a similar gap has been reported by [11] for the ALS algorithm.Along the lines of their discussion, we believe that, in the case of CALS, this gap is due to the convergence problems which are always observed in practice, at least for a few runs.As for CPTOEP, this seems to happen because the adapted procedure yields suboptimal estimates.
Finally, we note that the above comparison is justified since, under the assumption of Gaussian additive noise, the least-squares criterion leads to the maximum likelihood (ML) estimator.In signal-in-noise problems, the ML estimator is often approximately unbiased even for a small sample size, provided that the SNR is sufficiently high [10].

CONCLUSION
A closed-form expression of the CRB has been derived for the parameter estimates of a CPD having banded circulant factors.Then, two specialized algorithms have been proposed to compute the CPD of a symmetric tensor whose factors are circulant Toeplitz.The first is an adaptation of the ALS method taking the structural constraints into account, whereas the second is composed of two steps: (i) compute an approximate solution thanks to a non iterative algorithm (CPTOEP), and (ii) refine the solution by a quasi-Newton descent (BFGS).The latter (CPTOEP-BFGS) reached the Cramér-Rao bound over a wide range of SNR values.The proposed algorithms have been applied to identify WH models, and their statistical performance has been evaluated using the derived CRB.
Furthermore, since k (p) (m 1 , . . ., m p ) is symmetric in m 1 , . . ., m p , in practice one estimates only the elements whose indices pertain to a suitable non-redundant domain such as D = {(m 1 , . . ., m p ) : m 1 ≤ • • • ≤ m p }, determining the others by symmetry.Hence, Y and N are also pth-order symmetric tensors, containing redundant elements.Introducing the selection matrix Ψ ∈ R I×M p , where I = |D| = M+p−1 p , which contains as rows 1 every product of the form e T mp ⊠ . . .⊠ e T m1 for (m 1 , . . ., m p ) ∈ D, we can write the (non-redundant) vectorized model y Ψvec(Y) = x + n ∈ R I , where x = Ψvec(X) and n = Ψvec(N) is a random vector.Now, from (4), we can deduce vec

Table 1 .
Simulation results: estimated MSE η values (in dB).We choose ǫ Y = 10 −10 and set the tolerance of BFGS also to 10 −10 .For each estimate η(y), we compute ε η = η − η(y) 2 .This procedure is repeated for 100 realizations of N and then ε η is averaged for each level of σ , yielding a mean-square error estimate denoted by MSE η .The results are shown in Table