Multiple Subspaces Separation in Case of Camera Motion

We explore the problem of subspace clustering. Given a set of data samples approximately drawn from a union of multiple subspaces, our goal is to cluster the samples into respective subspaces, and also remove possible outliers. We propose an Approximated Robust PCA Clustering (ARPCAC) method that involves extracting the point trajectories only induced by object motion, from the pool of all motions induced by objects and camera motion, and then projecting them onto a 5-dimensional space, using PowerFactorization. Our algorithm can be used to segment multiple motions in video and furthermore, is extended to the problem of face clustering. Conducted experiments demonstrate state-of-the-art performance.


Introduction
Several types of visual data, such as motion, face, and texture, have been known to be well characterized by subspaces. Recently, there has been an increasing interest on the geometrical and statistical models for the understanding of dynamic scenes, in which both the camera and multiple objects move. The widely used Principal Component Analysis (PCA) method and the recently established matrix completion and recovery methods are essentially based on the hypothesis that the data is approximately drawn from a low-rank subspace. However a given dataset can seldom be well described by a single subspace. A more reasonable model is to consider data as lying near several subspaces. When the data is clean, i.e., the samples are strictly drawn from the subspaces, several existing methods (e.g., [3,5,17]) are able to exactly solve the subspace segmentation problem. So, the main challenge of subspace segmentation is to handle the errors (e.g., noise and corruptions) that possibly exist in data, i.e., to handle the data that may not strictly follow subspace structures. With this outlook, we study the following robust subspace clustering problem: Given a set of data samples approximately drawn from a union of linear subspaces, correct the possible errors and segment all samples into their respective subspaces, and simultaneously reveal each subspace's independent motion. By independent motion we mean the camera-induced motion has been subtracted from all the motions in the scene, where the motion trajectory of an object can be revealed. Two main applications, motion seg-mentation and face clustering are studied in this paper for this problem.
We propose a novel method termed ARPCAC (Approximated Robust Principal Component Analysis Clustering). Given a set of data samples, each of which can be represented as a combination of a low-rank and sparse subspace, ARPCAC aims at finding the lowest rank representation of all data jointly, while simultaneously revealing the independent motion of each subspace. The computational procedure of ARPCAC is to solve a Frobenius and 2,1 -norm regularized optimization problem, which is convex and can be solved in polynomial time. It can be shown that the ARPCAC can well solve the subspace clustering problem. The subspace membership is provably determined by belonging to either of the low-rank, sparse, or error patterns, and hence the ARPCAC can perform robust subspace clustering and error correction in an efficient way. Motion segmentation from multiple views has been studied in the case of affine cameras, because in this case the motion of each one of the rigidly moving objects lives in a four-dimensional subspace [5]. In this paper however, we do not need to assume an affine camera model, since the camera motion will be compensated for by the dominant subspace that is reasonably close to the background motion in most practical applications.

Related Work
Mixture of Gaussian has been used in [12] where a maximum likelihood estimate was used, and in [10] where Random Sample Consensus (RANSAC) were adopted. These methods are sensitive to errors, and this problem is still not well solved due to optimization difficulty. Factorization-based methods [3] seek to approximate the given data matrix as a product of two matrices such that the support pattern for one of the factors reveals the segmentation of the samples. Generalized Principal Component Analysis (GPCA) [18] presents an algebraic way to model the data drawn from a union of multiple subspaces. However, this method is sensitive to noise due to difficulty of estimating the polynomials from real data. Subspace segmentation has also been regarded as a clustering problem , where an affinity matrix is learned to obtain the final segmentation results by spectral clustering (SC) algorithms such as Sparse Subspace Clustering (SSC) [5], the LRR [17], and the proposed ARP-CAC method. The main difference is the approach for learning the affinity matrix.
In this section, we present the ARPCAC method for recovering a matrix from corrupted and incomplete observations. Let D be a collection of data samples in presence of outliers and corruptions. That is, for the set of points X p ∈ P 3 in frame f ∈ F , we can stack all the image measurements into a 2F × P matrix D as In order to recover the low-rank matrix L from the given observation matrix D corrupted by errors E it is straightforward to consider the following regularized rank minimization [6] min rank(L)≤r,E,τ where rank(L) ≤ r rank(D), and λ > 0 is a parameter, · F is the Frobenius-norm, and · 2,1 is the 2,1norm which is the 1 -norm of the vector formed by taking the 2 -norm of a matrix, and promotes group sparsity. The error pattern E can be described as combination of a sparse pattern S containing the underlying subspaces, and a noise pattern G that contains the noise, outliers, and incomplete samples. τ stands for some transformation in the image domain (e.g., 2D affine transformation for correcting misalignment, or 2D projective transformation for handling some perspective change in the camera model). And henceforth, L • τ describes the lowest rank estimate for samples drawn solely from the camera motion, whereas S describes all underlying subspaces, and G contains errors. From the sample set L • τ + S we can obtain reliable trajectories for all subspaces. The assumption here is that each subspace has a spectral nature, i.e., each subspace will form a unique affinity matrix that can be used to reveal the true segmentation of data. Also, L • τ provides the underlying lowest rank representation for the data that helps reduce the problem to a simple clustering of independent motions in the scene, as samples S are only drawn from object-induced trajectories.
The problem (2) is a difficult, non-convex optimization problem. Fortunately, we can find a good initialization by prealigning all frames in the sequences to the middle frame, before the main loops of minimization. The pre-alignment is done by the robust multi-resolution method proposed in [19]. This practice is successful in most cases given that a drastic scene change does not occur in the sequence. As described in [26], we can then solve (2) by repeatedly linearizing about the current estimate of τ , and seeking a deformation step ∆τ . In other words, at each iteration, we update τ by a small increment ∆τ and linearize A • τ as D • τ + J∆τ , where J denotes the Jacobian matrix J = ∂D ∂τ . Thus, τ can be updated via the following minimization problem The minimization over ∆τ in (3) is a weighted least-squares problem which has a closed-form solution. In practice, the update of τ for each frame can be done separately since the transformation is applied on each image individually. Thus the update of τ is efficient. Then we proceed by using an alternating minimization procedure to solve L and S one at a time until the solution reaches convergence and show that it is efficient; that means solving two reduced problems, each being minimized independently form one another The residual error of the approximation of D by L • τ + S is stored in G. The entries of G can be very large in magnitude, but random and scattered, exhibiting the behavior of error deviation as described. The discerning difference between S and G is that G shows no structure in the sparsity domain, that of which is determined by the 2,1 -norm minimizer.

Independent Subspace Motion Extraction
The obtained trajectories in S are induced from two motion components: rigid camera motion, and object motion. When the motion of interest includes global object motion, it can be further decomposed into two components: rigid object motion, and articulated motion. We employ the latest advances in sparse optimization to estimate each of these components, and extract the object trajectories which solely correspond to the motion of interest. [27] and [20] have assumed that the majority of the observed motion is induced by the camera motion; this assumption will not fit most realistic data, so we refrain from doing so to not cause any loss of generality. Therefore, the trajectories drawn from samples should generally span a subspace determined by the scene structure and the camera's intrinsic and extrinsic parameters. In order to find the basis for the subspace trajectory, we have obtained a 2F × P (P samples) matrix S from ARPCAC using the position vectors of the trajectories in a sequence Through the following rank minimization surrogate, we can decompose S into two components: a low-rank matrix L, and the sparse error matrix E arg min with · * defining nuclear-norm which is the sum of singular values L * = i (σ i ), and · 1 the 1 norm. ξ trades off the rank solution versus the sparsity of the error, and is always set to 1.1/ √ P following the theoretical considerations in [1]. The equation (7) can be solved with convex optimization methods such as the Augmented Lagrange Multiplier (ALM) algorithm [16]. The columns of the resulting low-rank matrix L define the basis of the low-rank components in the trajectories. The subspace spanned by the major basis of L correspond to the desired background subspace which includes both the background trajectories and the camera motion component of the foreground (objects in the scene) trajectories. On the other hand, any rigid body motions in the scene will also contribute to L; therefore, the subspace spanned by the rest of the basis of L mostly correspond to rigid body motions. Since the camera motion subspace is approximately spanned by three basis [12,5], the camera motion component can be estimated by L c = us * v T , where u and v are obtained by singular value decomposition [u, s, v] = SV D(L), and s * is the top three most significant singular values of s. Therefore, the rigid body motion component is expressed by L − L c . Moreover, the columns of the matrix E correspond to the deviation of each trajectory from the recovered low-rank subspace, which captures the articulated motions [27]. Therefore, the total object trajectories E t that include the articulated and the rigid body motion is given by Figure 1 shows the motion decomposition for two sequences from the Hopkins155 dataset. As it can be seen, the trajectories that are obtained for the background and foreground, are both contaminated by the camera motion. Note the motion trajectory of the woman walking in the middle column is completely different from the actual motion trajectory that is revealed by ARPCAC in the right column. Clean motion trajectories are crucial for applications such as human motion analysis, and the trajectories in the middle column -which is usually what is obtained by trajectory extractors -would adversely affect the results. In the next example we show that ARPCAC can cluster multiple independent motion subspaces. Figure 2 illustrates the motion decomposition for three examples in the Hopkins155 dataset. From these examples it is clear that the proposed independent object motion extractor is successful in subtracting camera motion from each motion subspace and simultaneously clustering each motion trajectory into its corresponding subspace.

Segmentation of Multiple Rigid-Body Motions
We can use a combination of ARPCAC and PowerFactorization that leads to the following geometric solution to the multiframe 3D motion segmentation problem [25]. First, we project the motion trajectories obtained from object-induced motion E extracted by ARPCAC onto a five-dimensional subspace using the PowerFactorization. Then, we fit a collection of subspaces to the projected trajectories, by fitting a homogeneous polynomial representing all motion subspaces to the projected data. Next, we obtain a basis for each motion subspace form the derivatives of this polynomial. And finally, we apply spectral clustering to a similarity built form the subspace angles.
We have tested our approach on a database of 155 motion sequences with full, independent, degenerate, dependent motions, missing data, outliers, etc. Our algorithm achieves error of 0.89% for two motions and 3.78% for three motions.

Projection using PowerFactorization
From here on, without loss of generality, we refer to the sample matrix as W that could refer to either object samples S or object-induced samples E. We wish to replace W by a matrix obtained by projecting its columns onto a 5-dimensional subspace. If AB T is the nearest rank-5 factorization to W , then W = B T is the matrix that we require. The measure of closeness of AB T to W is where I is the set of pairs (i, j) for which W ij is known. With PowerFactorization we start with a random matrix A 0 , and alternate the following steps until convergence of A k B T k . Essentially this algorithm alternates between computing A k and B k using least-squares.

Given
Orthonormalize the columns of B k by replacing it by a matrix B k such that B k = B k N k , where B k has orthonormal columns, and N k is upper-triangular. 3. Given B k , find the matrix A k that minimizes

Fitting Polynomials to Projected Trajectories
We have reduced the motion segmentation problem to finding a set of linear subspaces in R 5 , each of dimension at most 4, which contain the data points (or come close to them). The points in question, {w p } P p=1 , are the columns of the projected data matrixŴ = [w 1 , . . . , w P ] ∈ R 5×P . We obtain a polynomial q representing the n motion subspaces by computing its vector of coefficients c ∈ R Mn as the singular vector of the embedded data matrixW = [w 1 , . . . ,w P ] ∈ R Mn×P corresponding to its smallest singular value.

Feature Clustering via Polynomial Differentiation
The feature points can then be clustered by applying spectral clustering to the similarity matrix S ij = cos 2 (θ ij ), where θ ij is the angle between the vectors ∇q(w i ) and ∇q(w j ) for i, j = 1, . . . , P , with the derivative of q defined as a 5-vector ∇q(w) = (∂q/∂w 1 , . . . , ∂q/∂w 5 ) Then the standard factorization approach is applied to each one of the n group of features to obtain motion and structure parameters.

Hopkins155
To verify the segmentation performance of ARPCAC, we adopt for experiments the Hopkins155 [24] motion database, which provides an extensive benchmark for testing various subspace segmentation algorithms. In Hopkins155, there are 155 video sequences along with the features extracted and tracked in all the frames. The segmentation performance for this dataset is shown in Table 1. These results illustrate that ARPCAC performs considerably better than other PCA-based counterparts, namely PCA, RPCA 1 , RPCA 2,1 , SR, LRR, and GPCA. Besides the superiority in segmentation accuracy, another advantage of ARPCAC is that it can work well under a wide range of parameter settings as we chose the same λ value for all the test, whereas other PCA-based methods except for LRR are   sensitive to the parameter λ. As for comparison to state-ofthe-art methods in the lower tier of Table 1, our method performs on par, and achieves third place after SSC and SLBF. This performance can be improved if λ is tuned per-problem, but we refrain from doing so as we would like to demonstrate an autonomous performance. Moreover, our algorithm is superior in clustering multiple motions in a scene as shown in Table 3, whereas, SSC and SLBF both are more well-suited for single motion segmentation. The efficiency in terms of running time of ARPCAC is comparable to PCA and surpasses that of other PCA-based methods as shown in Table 2, making it suitable for real-time performance. The results of applying subspace clustering algorithms to the dataset using the original 2F -dimensional feature trajectories for 2-motion and 3-motion categories on Hopkins155 are shown in Table 3. Our algorithm achieves top performance in all motion categories.

Yale-Caltech
To test ARPCAC's effectiveness in the presence of outliers and corruptions, we create a dataset by combining the Extended Yale Database B [14] and Caltech101 [8].  some examples of this dataset. It can be seen in Table 4 that ARPCAC is better than PCA and RPCA methods in terms of both subspaces segmentation and outlier detection. To visualize ARPCAC's effectiveness in error correction, Figure 3-Right shows some produced results. It is worth noting that the "error" term E can contain "useful" information, e.g., eyes and salient parts, that can be used for emotion and visual cue recognition. The low-rank part L • τ corresponds to the principal features of each subject that discriminate it from the rest of the data. The aligned and cleaned L • τ part can be used for face recognition, and face clustering as done in this paper.

LFW
The Labeled Faces in the Wild (LFW) database of public figures [13] exhibit significant variations in pose and facial expression, illumination, and occlusion; moreover, the ground truth (i.e., undistorted, not rotated, not shifted) image is not known. In total there are 681 samples of images taken from 20 subjects. Our ARPCAC aligns these images to a 80 × 60 canonical frame, and Affine transformations τ are used to cope with large variability in poses. Figure 4 shows one example from the results on this dataset. Our algorithm proves itself to be effective even in presence of large misalignment and corruptions.

Conclusion
We have proposed a low-rank and sparse representation to identify the subspace structures from corrupted data. Namely, our goal is to segment the samples into their respective subspaces and correct the possible errors simultaneously while revealing each subspace's independent motion. ARPCAC is a generalization of the recently established RPCA method [1], extending the recovery of corrupted data from single subspace to multiple subspaces that are dynamic where both camera and the scene objects move. Both theoretical and experimental results show the effectiveness of ARPCAC in subspace segmentation and misaligned and corrupted face clustering applications. In future, we would like to extend ARPCAC to not rely on Power-Factorization and Spectral Clustering for the separation of the extracted independent motions, by the low-rank and sparse decomposition.