Efficient algorithms for ‘universally’ constrained matrix and tensor factorization

We propose a general algorithmic framework for constrained matrix and tensor factorization, which is widely used in unsupervised learning. The new framework is a hybrid between alternating optimization (AO) and the alternating direction method of multipliers (ADMM): each matrix factor is updated in turn, using ADMM. This combination can naturally accommodate a great variety of constraints on the factor matrices, hence the term `universal'. Computation caching and warm start strategies are used to ensure that each update is evaluated efficiently, while the outer AO framework guarantees that the algorithm converges monotonically. Simulations on synthetic data show significantly improved performance relative to state-of-the-art algorithms.


INTRODUCTION
Constrained matrix and tensor factorization techniques are widely used for latent parameter estimation and blind source separation in signal processing, dimensionality reduction and clustering in machine learning, and numerous other applications in diverse disciplines, such as chemistry and psychology. Unconstrained rank factorization of matrices and tensors is relatively well-studied, as in the matrix case the basis of any solution is simply the principal components of the singular value decomposition (SVD), and in the tensor case alternating least squares (ALS) and other algorithms usually yield satisfactory results. However, the available algorithms for constrained matrix and tensor factorization leave much to be desired as of this writing, and a unified framework that can easily and naturally incorporate multiple constraints on the latent factors is sorely missing. Existing algorithms are usually only able to handle one or at most few specialized constraints, and/or the algorithm needs to be redesigned carefully if new constraints are added.
In this paper, we propose a general algorithmic framework that seamlessly and relatively effortlessly incorporates many common types of constraints and compositions thereof, building upon and bridging together the alternating optimization (AO) framework and the alternating direction method of multipliers (ADMM). While the AO framework provides monotone convergence for these (NP-)hard factorization problems, ADMM serves as a sub-routine that handles almost 'universal' constraints very efficiently, with per-iteration complexity similar to the unconstrained LS algorithm or sub-optimal approaches like the multiplicative non-negative matrix factorization (NMF) update.

Notation
We denote the (approximate) factorization of matrix Y ≈ WH T , where Y is m × n, W is m × k, and H is n × k, k ≤ m, n, and in most cases much smaller. Note that adding constraints on W and H may turn the solution from easy to find (via SVD) but non-identifiable, to NP-hard to find but identifiable. It has been shown that simple constraints like non-negativity and sparsity can make the factors essentially unique, but at the same time, computing the optimal solution becomes NP-hard -see [1] and references therein.
For simplicity, we only consider 3-way tensors in this paper, which are denoted with an under-score Y, of size n 1 × n 2 × n 3 . In what follows, we focus on the so-called parallel factor analysis (PARAFAC) model, also known as canonical decomposition (CANDECOMP) or canonical polyadic decomposition (CPD), which is essentially unique under mild conditions [2], but constraints certainly help enhance estimation performance, and even identifiability. One compact way to represent the CPD model is by writing it in matricized form as Y T (1) ≈ (C B)A T , where A, B, and C are the three factors sharing the same number of columns k, Y (1) is the mode-1 matricization of Y, and denotes the Khatri-Rao product. Details of these tensor related operations can be found in tutorial papers like [3], and are omitted here.

Alternating direction method of multipliers
The ADMM solves convex problems that can be put in the following form where u is a scaled version of the dual variable corresponding to the equality constraint Ax + Bz = c, and ρ is specified by the user.
A comprehensive review of the ADMM algorithm can be found in [4] and the references therein. ADMM converges under very mild conditions (f and g being closed, proper, and convex, and that an optimal solution exists). Notice that smoothness is not required, which means inequality constraints can be incorporated into the cost function by setting the function value to be +∞ if the constraints are not satisfied. Moreover, the splitting of x and z can be carefully chosen so that each of the updates is trivial (possibly in closed-form).

CONSTRAINED MATRIX FACTORIZATION
We start with an algorithm for constrained matrix factorization. We adopt the common matrix factorization setting by formulating it as the optimization problem where r W (·) and r H (·) are regularizations imposed on the latent factors W and H, and can take value +∞, so that any hard constraints can also be included. The problem (2) is not convex in both W and H, but is convex in one if we fix the other, thus alternating optimization is usually employed. We first describe how to efficiently update H while fixing W using ADMM, and then treat this as a sub-routine to solve (2) efficiently.

ADMM for regularized least-squares
We first reformulate the subproblem for the update of H as It is easy to adopt the ADMM algorithm and derive the following iterates: One important observation is that, throughout the iterations the same matrix W T Y and matrix inversion (W T W+ρI) −1 are used. Therefore, to save computations, we can cache W T Y and the Cholesky decomposition of W T W + ρI = LL T . Then the update ofH is dominated by one forward substitution and one backward substitution, resulting in a complexity of O(k 2 n).
The update of H is the so-called proximal operator of the function (1/ρ)r H (·) around point (H T −U), and in particular if r H (·) is an indicator function of a convex set, then the update of H becomes a projection operator, a special case of the proximal operator. For a lot of regularizations/constraints, especially those that are often used in matrix factorization problems, the update of H boils down to element-wise updates, i.e., costing O(kn) flops. Many efficiently implementable proximal operators can be found in [5, §6]. Here we list some of the most commonly used constraints/regularizations in the matrix factorization problem. For simplicity of notation, let us defineH =H T − U.
• Non-negativity. In this case r H (·) is the indicator function of R + , and the update is simply zeroing out the negative values ofH. In fact, any element-wise bound constraints can be handled similarly, since element-wise projection is trivial.
• Lasso regularization. For r H (H) = λ H 1 , the sparsity inducing regularization, the update is the well-known softthresholding operator: The element-wise thresholding can also be converted to blockwise thresholding if we know a priori that the zeros are grouped, leading to the group sparsity regularization.
• Simplex constraint. In some probabilistic model analysis we need to constrain the columns or rows to be elementwise non-negative and sum up to one. As described in [6], this projection can be done with a randomized algorithm with linear-time complexity on average.
We found empirically that by setting ρ = W 2 F /k, the ADMM iterates for the regularized least-squares problem converge very fast. With a good initialization (which will be discussed later), the update of H usually does not take more than 5 or 10 ADMM iterations. The proposed algorithm for the sub-problem (3) is summarized in Alg. 1. As we can see, the pre-calculation step takes O(k 2 m + k 3 ) flops to perform the Cholesky decomposition, and O(mnk) flops to form F. In the iterations the complexity is dominated by thẽ H-update, with complexity O(k 2 n). Notice that for the unconstrained problem, the complexity is essentially the same as the pre-calculation step plus one iteration. For a small number of ADMM iterations, the complexity of Alg. 1 is of the same order as the unconstrained problem.
9 until convergence; 10 return H and U.

Alternating optimization
Now, naturally, Alg. 1 is used as the sub-routine for alternating optimization. Since W and H are updated alternatingly, one would expect that the W and H (and their corresponding dual variables) obtained in the previous iteration should not be very far away from the updates for the current iteration. Thus, we can initialize the current ADMM update using the previous results. Experiments show that this can greatly reduce the number of inner-iterations required for ADMM. Soon after an initial transient stage, the sub-problems can be solved in just one iteration. The proposed algorithm for constrained matrix factorization is summarized in Alg. 2.
Algorithm 2: Proposed algorithm for (2) The convergence of Alg. 2 follows from that of the general block coordinate descent algorithm framework, i.e., the objective is monotonically decreasing, and every limit point is a stationary point, if there are only two blocks [7]. If we limit the maximum number of ADMM inner-loops, the updates at the initial phase will not be guaranteed to be conditionally optimal -but this can be treated as a generalized initialization step. By using the proposed warm-start strategy, the ADMM inner-loops soon become conditionally optimal, thus ensuring convergence of the outer-loop.

EXTENSION TO TENSOR FACTORIZATION
Going from two-way to higher-way data arrays is easy when using our approach: we can again adopt AO over the matrix factors, and solve each of the matrix sub-problems using Alg. 1. Without loss of generality, let us consider the update of A, which is arg min H of (3) with W = C B, the Khatri-Rao product of C and B, and Y = Y T (1) , the mode-1 matricization of the data tensor Y.
Due to the Khatri-Rao structure of W, W T W can be computed efficiently as C T C * B T B, where * is the element-wise (Hadamard) product. How to efficiently calcu- (1) depends on how the data tensor Y is stored, and various efficient implementations have been proposed [8][9][10]. For simplicity, we only consider the dense case, for which a computationally (but not memory-) efficient way is to replicate the tensor in three matricizations Y (1) , Y (2) , and Y (3) , so that in each iteration they are readily available. Beyond the computation of these special products, the rest of the computations involved in the ADMM inner-loops are all very cheap, as we have explained in the matrix case. Detailed algorithm description in the tensor case is omitted due to space limitations, and will be given in the journal version.
As we can see, the ADMM is an appealing sub-routine for alternating optimization, leading to a simple plug-andplay generalization of the workhorse ALS algorithm. Theoretically, they share the same per-iteration complexity if the number of inner ADMM iterations is small, which is true in practice, after an initial transient. Efficient implementation of the overall algorithm should include data-structurespecific algorithms for (C B) T Y T (1) , and may include parallel/distributed computation along the lines of [11]. Related developments will be included in the journal version.

RELATED WORK
The most common constraint imposed on the latent factors is non-negativity. Traditional methods used for the alternating sub-problems are active-set (AS) [12] and its variations like block principle pivoting [13,14]. The main idea is that the problem becomes unconstrained if we figure out the variables that should be equal to zero. However, even if we know exactly where the zeros should be (which is approximately so in the alternating optimization framework), the least-squares problem for each row of H is different. One can form W T W and W T Y once, but for each row of H different subsets of entries in W T W should be inverted. This becomes unappealing when k grows large. Some other methods, like the multiplicative-update (MU) [15] or hierarchical alternating least squares (HALS) [16], ensure that the per-iteration complexity is dominated by calculating W T W and W T Y, although more outer-loops are needed for convergence. To move beyond non-negativity, MU is not applicable, AS is di-rectly amendable to sparsity regularization only, and HALS is able to handle all constraints imposed on the columns of H [17], but not the rows.
The ADMM algorithm for constrained/regularized linear least-squares problems is widely used in large scale compressive sensing and image processing [18,19] applications, thanks to its nice convergence properties and the ability of Cholesky caching. However, to the best of our knowledge, this is the first time that it is used as a sub-routine for constrained matrix/tensor factorization. ADMM is a perfect fit for this kind of problem, because essentially each subproblem is a large number of constrained least-squares sharing the same mixing matrix, thus one matrix inversion is not only amortized throughout the inner iterations but also across different rows (as opposed to, say, what happens with the active-set method), resulting in per-iteration complexity that is almost the same as the unconstrained one. Furthermore, the alternating optimization framework naturally provides good initializations, which greatly reduces the number of inner-iterations required.
We should emphasize that our proposed algorithm is in general alternating optimization, and ADMM is only applied to the sub-problems. There are also algorithms that directly apply the ADMM approach to the whole problem [11,20]. In that case, the per-iteration complexity is also the same as the unconstrained alternating least-squares. However, due to the non-convexity of the problem, the objective is not guaranteed to decrease monotonically, unlike alternating optimization. Moreover, both ADMM and AO guarantee that every limit point is a stationary point, but in practice AO almost always converges, which is not the case for ADMM (applied to the whole problem).

NUMERICAL EXAMPLES
In this section we provide some illustrative examples to showcase the numerical performance of our proposed algorithm on synthetic data, compared to some of the existing algorithms mentioned in the previous section. All experiments are performed in MATLAB 2013a on a Linux server with 32 Xeon 2.00GHz cores and 128GB memory.

Non-negative matrix factorization
Here we compare the proposed algorithm with some state-ofthe-art NMF algorithms on synthetic data. By specifying the values of m, n, and k, the true W and H are generated by drawing the elements from an i.i.d. exponential distribution with mean 1, and then 50% of the elements are randomly set to 0. The data matrix Y is then set to be their product Y = WH T + N, where the elements of N are drawn from an i.i.d. Gaussian distribution with variance σ 2 = 10 −2 .
Three algorithms are used for comparison: HALS Hierarchical alternating least-squares [16];  AO-BPP Alternating optimization using block principle pivoting [13] 1 ; since these two algorithms are reported in [13] to out-perform other NMF algorithms, and ADMM ADMM applied to the whole problem [20] 2 . The proposed algorithm is denoted as AO-ADMM. All algorithms are initialized with the same random point.
The convergence time in seconds versus the relative error Y − WH T F / Y F of the aforementioned algorithms for two indicative problem instances is shown in Fig. 1. On the left, the problem size is m = n = 2000 and k = 100. As we can see, our proposed algorithm converges with the fastest rate. On the right, the problem size is increased to m = n = 5000, and k = 300, and the gap becomes larger, indicating that our proposed algorithm has better scalability, thus it is more suitable for big data. The first case is also repeated 10 times with random synthetic data, and the averaged result is given in Table 1.

Non-negative PARAFAC
We now test the algorithms in the case of tensors. For different values of n 1 , n 2 , n 3 , and k, the true latent factors A, B, and C are generated in the same manner as W and H in the NMF case. The data tensor Y is then generated as a low-rank PARAFAC model with additive noise Again, our proposed algorithm, denoted as AO-ADMM, is compared with HALS [16] 1 , AO-BPP [14] 1 , and ADMM [11], and the convergence time in seconds versus the relative error in indicative problem instances is shown in Fig. 2.   All algorithms are initialized with the same random point 3 .
On the left, n 1 = n 2 = n 3 = 500 and k = 100, while a somewhat unbalanced case is shown on the right with n 1 = 1000, n 2 = 500, n 3 = 200 and k = 100. The first case is also repeated 10 times with random synthetic data, and the averaged result is given in Table 2. For this experiment we have also included Tensorlab [21], which handles non-negative PARAFAC using "all-at-once" derivative-based nonlinear least-squares optimization techniques. As we can see, AO-ADMM again outperforms all other algorithms in all cases.

CONCLUSIONS AND FUTURE WORK
In this paper we proposed a general algorithmic framework for matrix and tensor factorization that aims to approximate the data matrix/tensor in the least-squares sense, with the ability to efficiently handle almost 'universal' constraints/regularizations imposed on the latent factors. The algorithm builds upon the widely used alternating optimization framework, but the constrained sub-problems are solved via the alternating direction method of multipliers. Simulations on synthetic data, with non-negativity constraints imposed, showed great performance compared to the state-of-the-art algorithms. More simulations of this algorithm with more constraints imposed and applied to real data will be given in the journal version. We are also working on applying this algorithmic framework for factorization problems where the error measure is different from the least-squares criterion, e.g., the l 1 loss, the Kullback-Leibler divergence, and/or loss evaluated only at a 3 However, ADMM is reported to be more sensitive to 'good' initializations [11], thus ADMM is probable to be restarted at a different initialization. subset of entries in Y when there are missing values. The periteration complexity is still similar to the unconstrained ALS algorithm, again by exploiting Cholesky caching and warm start. The goal is to provide a simple algorithmic framework for constrained matrix and tensor factorization which maintains the cheap per-iteration complexity and monotone reduction of the cost function, while providing flexibility in terms of constraints and faster convergence.