Class-specific nonlinear subspace learning based on optimized class representation

In this paper, a new nonlinear subspace learning technique for class-specific data representation based on an optimized class representation is described. An iterative optimization scheme is formulated where both the optimal nonlinear data projection and the optimal class representation are determined at each optimization step. This approach is tested on human face and action recognition problems, where its performance is compared with that of the standard class-specific subspace learning approach, as well as other nonlinear discriminant subspace learning techniques. Experimental results denote the effectiveness of this new approach, since it consistently outperforms the standard one and outperforms other nonlinear discriminant subspace learning techniques in most cases.


INTRODUCTION
Standard Discriminant Learning techniques, like Linear Discriminant Analysis (LDA) [1,2], Kernel Discriminant Analysis (KDA) [3], (kernel) Spectral Regression (KSR) [4] and Class-specific (kernel) Discriminant Analysis (CSKDA) [5], represent classes by adopting the corresponding class mean vectors. Thus, they inherently set the assumption that the classes forming the classification problem follow unimodal normal distributions having the same covariance structure [2]. However, these are two strong assumptions that are difficult to be met in real classification problems. It has been recently shown that, when these assumptions are not met, the adoption of optimized class representations, other than the class mean vectors, leads to the determination of a discriminant subspace of increased class discrimination power [6,7]. In this paper, we follow this line of work and describe an optimization scheme for the determination of such an optimized class representation for class-specific nonlinear data projection that leads to the determination of a discriminant subspace having increased class discrimination power.
In detail, in this paper we describe a new class-specific discrimination criterion which is used to optimize both the data projections and the class representation for the determination of a low-dimensional feature space of increased discrimination power. This class-specific criterion is formulated so that to exploit data representations in arbitrarydimensional Hilbert spaces for nonlinear data projection and classification [8][9][10][11]. An iterative optimization schemes is applied to this end, which optimizes the class-specific criterion with respect to both the data projection matrix and the class representation. For the calculation of the optimal data projection matrix, an optimization process based on the Spectral Regression framework [4] is adopted in order to obtain a fast optimization method, when compared to the standard approach [3,5]. We compare the performance of the Class-specific Reference Discriminant Analysis (CSRDA) algorithm with that of other Discriminant Analysis-based classification schemes, i.e., KDA, KSR and CSKDA, as well as with the performance of the Kernel Support Vector Machine (KSVM) classifier, which is one of standard choices in nonlinear classification problems. Experiments are conducted on six publicly available datasets, namely the ORL [12], AR [13] and Extended YALE-B [14] for face recognition and Holly-wood2 [15], Olympic Sports [16] and ASLAN [17] datasets for human action recognition.
The rest of the paper is organized as follows. In Section 2, an overview of related work is provided. The CSRDA method is described in Section 3. Experimental results evaluating its performance are provided in Section 4. Finally, conclusions are drawn in Section 5.

RELATED WORK
Let us denote by x i ∈ R D , i = 1, . . . , N a set of N vectors, each belonging to a class appearing in a class set C = {1, . . . , C}. Let us also denote by c j ∈ R N , j = 1, . . . , C, C binary vectors having elements equal to c ji = 1 in the case where x i belongs to class j and to c ji = 0, otherwise. We use N j0 and N j1 in order to denote the number of zeros and ones in c j , respectively. By using x i , i = 1, . . . , N and c j , j = 1, . . . , C, a feature space of reduced dimensionality d < D can be determined by learning a nonlinear data projection of the vectors x i to vectors z i ∈ R d .
In order to exploit kernel techniques for nonlinear data projection, the input space R D is mapped to an arbitrarydimensional feature space F (usually having the properties of Hilbert spaces [8][9][10][11]18]) by employing a function ϕ(·) : x i ∈ R D → ϕ(x i ) ∈ F determining a nonlinear mapping from the input space R D to the arbitrary-dimensional space F. In this space, we would like to determine a data projection matrix W that will be used to map a given sample x i to a low-dimensional feature space R dj of increased discrimination power: In practice, since the multiplication in (1) can not be directly computed, the so-called kernel trick [8,9] is adopted. That is, the multiplication in (1) is inherently computed by using dot-products in F. Standard nonlinear Discriminant Learning techniques, like KDA [3] and KSR [4], solve an optimization problem involving relations between the within-class and between-class scatters of the training data in F. That is, they employ the class mean vectors: in order to calculate the within-class and between-class scatter matrices: is the mean of the entire training set in F, and calculate the data projection matrix W by solving an optimization problem that is function of S w , S b , e.g., the trace ratio optimization problem [1,19]: While the multi-class discriminant learning approach described above is able to determine a reduced-dimensionality feature space of increased class discrimination, it has been shown that class-specific discriminant learning methods are able to outperform multi-class ones in several tasks, like facial image classification [5]. In this case, the objective is the determination of a reduced-dimensionality feature space R dj , d j < D, where class j is better discriminated from all others. This is achieved by optimizing the trace ratio criterion using the following scatter matrices: where the class mean vector ϕ(m j ) is employed for the representation of class j in F. It has been recently shown that, for the multi-class subspace learning problem, the adoption of optimized class representations increases class discrimination in the reduced-dimensionality feature space, leading to enhanced performance [6,7]. In the following Section, we describe a class-specific optimization scheme that can be employed for the determination of both optimized class representation and data projection.

CLASS-SPECIFIC PROJECTIONS BASED ON OPTIMIZED REPRESENTATION
Let us denote by ϕ(µ j ) ∈ F a so-called reference vector that will be used in order to represent class j. ϕ(µ j ) is not restricted to be the class mean vector in F. ϕ(µ j ) can be any vector that enhances the discrimination of class j from the remaining ones in the discriminant space R dj . As has been previously described, we would like to learn a data projection matrix W which maps F to a low-dimensional discriminant space R dj where the samples belonging to class j are as close as possible to the image of ϕ(µ j ) in R dj , i.e., z j = W T ϕ(µ j ), while the samples belonging to the remaining classes are as far as possible from it. That is, we would like to learn a projection matrix W ∈ R |F |×dj minimizing: and maximizing: W can be determined by solving for: where S j , S 0 are defined by: The direct maximization of (10) is intractable since S j , S 0 express the intra-class and out-of-class variances of the training samples with respect to ϕ(µ j ), respectively (S j , S 0 are matrices of arbitrary dimensions). In the following subsection, we describe an optimization process that can be used in order to maximize (10) for the determination of the optimal data projection W, which is based on kernel Spectral Regression [4]. Subsequently, we describe an optimization process that can be used in order to determine the optimal class representation ϕ(µ j ) (given W) and the iterative optimization process that can be used in order to optimize J with respect to both W and ϕ(µ j ). Finally, we describe a classification process that can be employed in combination with this method.

Spectral Regression-based optimization of (10)
In order to directly optimize J in (10), we express W as a linear combination of the training data (represented in F) [8,9,18], i.e.,: A ∈ R N ×dj is a matrix containing the reconstruction weights of W, with respect to the training data in F. Φ is a matrix containing the data representations in F. Without loss of generality, we assume that the data are ordered so that where Φ j is a matrix containing the training data belonging to class j and Φ 0 is a matrix containing the remaining samples. Let us denote by v an eigenvector of the problem S 0 v = λS j v with eigenvalue λ. v can be expressed as a linear combination of the training data in F, i.e., v = ∑ N i=1 α i ϕ(x i ). By setting Ka = q, this eigenanalysis problem can be transformed to the following equivalent problem: Thus, the reconstruction weights matrix A can be performed by applying a two step procedure: • Solution of the eigenproblem P 0 q = λP j q, which is tractable since P 0 , P j ∈ R N ×N . The solution of this problem leads to the determination of a matrix Q = [q 1 , . . . , q dj ], where q i is the eigenvector corresponding to the i-th largest eigenvalue.
• Determination of the matrix A = [a 1 , . . . , a dj ], where Ka i = q i . In the case where K is non-singular, the vectors a i are given by a i = K −1 q i . When this is not true, the vectors a i can be obtained by solving the following set of linear equations: where δ > 0 is a regularization parameter. Thus, a i is given by a i = (K + δI) −1 q i .
As can be seen, the above-described optimization process requires the solution of one eigenanalysis problem (14) and the inversion of a N × N matrix, leading to a time complexity equal to O(N 3 ).

Reference Class Vector calculation
By observing that S j , S 0 are functions of ϕ(µ j ), as detailed in (10), and by using ϕ(µ j ) = Φ j b j [8,9,18], ϕ(µ j ) can be inherently determined by maximizing J with respect to b j , i.e.,: By solving for ∇ bi (J (W, b j )) = 0, we obtain: where

Optimization with respect to both A and b j
Taking into account that A is a function of b j and that b j is a function of A, a direct maximization of J with respect to both A and b j is difficult. In order to maximize J with respect to both A and b j , we employ an iterative optimization scheme, where A and b j are iteratively updated until (J (t + 1) − J (t))/J (t) < ϵ, where ϵ is a small positive value (equal to ϵ = 10 −6 in our experiments).

Classification (test phase)
In order to perform classification, we work as follows. After the determination of the discriminant space R dj , both the training data x i , i = 1, . . . , N and the reference class vector ϕ(µ j ) are mapped to that space and z i , i = 1, . . . , N , z j are obtained. Subsequently, we calculate distance vectors d i ∈ R dj having elements equal to: where z ik , z jk are the k-th elements of z i and z j , respectively. | · | denotes the absolute value operator. By using d i , classification can be performed based on a linear classifier, e.g., linear SVM. In case of multi-class classification, we train C linear SVM classifiers in an one-versus-rest manner using the above described process. A test sample is introduced to all the C classifiers and is assigned to the class providing the maximal probability, similar to [20,21].

EXPERIMENTS
In this section, we present experiments conducted in order to compare the performance of the two class-specific discriminant learning approaches. We have employed six publicly available datasets to this end. These are: the ORL [12], AR [13] and Extended YALE-B [14] (face recognition) and the Hollywood2 [15], Olympic Sports [16] and ASLAN [17] (human action recognition) datasets. In all our experiments we compare the performance of the Class-Specific Reference Discriminant Analysis (CSRDA) with that of the Class-Specific Kernel Discriminant Analysis (CSKDA) [5], as well as with Kernel Spectral Regression (KSR) [4], Kernel Discriminant Analysis (KDA) [3] and kernel Support Vector Machine (SVM)-based classification [22]. In all the experiments involving facial image classification we have employed the RBF kernel function. In human action recognition, we used the state-of-the-art methods proposed in [17,23] as baseline approaches. On the ASLAN dataset we employ a set of 12 histogram similarity values expressing the similarity of pairs of videos represented by using the BoW model for HOG, HOF and HNF descriptors evaluated on STIP video locations [24] combined with a linear classification scheme. For the remaining datasets, we employ the BoW-based video representation by using HOG, HOF, MBHx, MBHy and (normalized) Trajectory descriptors evaluated on the trajectories of densely sampled interest points [23] and classification is performed by a nonlinear classification scheme using the RBF-χ 2 kernel function.

Results
We have applied the competing algorithms on the face recognition data sets. Since there is not a widely adopted experimental protocol for these datasets, we randomly partition the datasets in training and test sets as follows: we randomly select a subset of the facial images depicting each of the persons in each dataset in order to form the training set and we keep the remaining facial images for evaluation. Experimental results obtained by applying the competing algorithms are illustrated in Table 1. Class-specific classification schemes outperformed the multi-class ones in all but one cases. By optimizing both the data projection matrix and the class representation, CSRDA enhances class discrimination when compared to CSKDA, leading to enhanced classification performance. Table 2 illustrates the performance obtained by applying the competing classification schemes on the action recognition data sets. It can be seen that CSRDA provides satisfactory performance in all cases.

CONCLUSIONS
In this paper, we described a new nonlinear subspace learning technique for class-specific data representation based on an optimized class representation. An iterative optimization scheme was formulated and evaluated to this end, where both the optimal nonlinear data projection and the optimal class representation are determined at each optimization step. Experimental results on six publicly available data sets denote the effectiveness of this class-specific approach, since it consistently outperforms the standard class-specific one and outperforms other nonlinear discriminant subspace learning techniques in most cases.