Incremental Accelerated Kernel Discriminant Analysis

In this paper a novel incremental dimensionality reduction (DR) technique called incremental accelerated kernel discriminant analysis (IAKDA) is proposed. Consisting of the eigenvalue decomposition of a relatively small-size matrix and the recursive block Cholesky factorization of the kernel matrix, a nonlinear DR transformation is efficiently computed at each incremental step. Moreover, employing factorization techniques of excellent numerical stability, IAKDA effectively removes data nonlinearities in the low dimensional subspace. Experimental evaluation on various multimedia tasks and datasets confirms that the proposed approach combined with linear support vector machines (LSVMs) offers improved mean average precision (MAP) and provides an impressive training time speedup over batch KDA and also over traditional LSVM and kernel SVM (KSVM).


INTRODUCTION
Kernel discriminant analysis (KDA) is one of the most popular dimensionality reduction techniques with important applications, among others, in multimedia analysis, computer vision and visualization [1,2,4,13,16,22,23,31,42]. This method learns a nonlinear transformation by first applying a nonlinear mapping from the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.  input space to a new feature space and then solving a generalized eigenproblem (GEP) defined by the between-and within-scatter matrices. Despite its success in many challenging problems, its difficulty to scale well with the number of training observations has restricted the widespread use of this technology in today's Big Data problems. This limitation is much more intense in incremental learning problems [37,53], where new training observations arrive sequentially one at a time, or in larger groups called chunks, as shown in Figure 1. For instance, in video event detection applications [8,14,[47][48][49] where usually the videos of the event are scarce and the event detectors are initially trained using only a few of them, the use of KDA to learn a new transformation matrix each time a new positive video example becomes available would have severe implications in the time-efficiency and usefulness of the overall approach.
During the past few years, a quite large variety of incremental linear discriminant analysis (LDA) algorithms have been developed. For instance, in [37], the between-and within-class scatter matrices are updated and their generalized eigenvalue decomposition is computed at every incremental step. A sequential DR method based on a rank-one QR updating algorithm [19] and the adaptive computation of the desired scatter matrices is presented in [55]. An extension of the above approach so that it can deal with chunks of incoming data is provided in [39]. In [56], the projection and total scatter matrices are updated using a variant of the generalized singular value decomposition. Assuming centered data, in [28,54], the indicator, class-means and projection matrices are updated at each incremental step using a least-square-based algorithm. In [52], the above method is extended for real-time streaming applications. In [9], an adaptive LDA method based on the incremental estimation of the square root of the inverse covariance matrix is proposed. Due to the slow convergence of this approach, in [33] a quasi-Newton algorithm is applied to accelerate the estimation of the above matrix. In [15,32], an explicit cost function and the Newton-Raphson or Gradient Descent algorithms are employed to provide a further speedup of the method presented in [9]. The sufficient set spanning approximation is used in [24] for updating the scatter and linear transformation matrices each time new observations are fetched. This approach is among the most efficient and accurate incremental variants of LDA.
The major drawback of the above methods is that they can not be successfully applied to nonlinear problems. The sequential variant of spectral regression KDA (SRKDA) presented in [6] is one of the few nonlinear incremental variants of LDA in the literature. This method computes a discriminant subspace using the Gram-Schmidt orthogonalization and spectral regression. However, it requires the centering of the data in the feature space, which may significantly increase its computational cost and round-off errors. Moreover, due to the centering operation, the assumption of SRKDA and its incremental variant that the kernel matrix is nonsingular is always violated, breaking down the theoretical development of the method [31,38].
In [17,18], accelerated KDA (AKDA) is presented in order to overcome the deficiencies of SRKDA and achieve state-of-the-art training time performance. Specifically, AKDA utilizes a novel matrix factorization and a simultaneous diagonalization framework to efficiently derive the DR transformation matrix without requiring normalized data. However, despite its efficiency this method is not appropriate for incremental learning problems. To this end, incremental AKDA (IAKDA) 1 is proposed here that exploits the framework introduced in [17,18] to initialize the DR transformation matrix, and the recursive block Cholesky factorization [19] to update this matrix incrementally. The proposed method offers several advantages over the incremental approaches described above, more importantly, it can effectively deal with data nonlinearities and does not require any data normalization. The experimental evaluation on various image and video classification datasets [2,45,46] shows that the proposed approach achieves improved accuracy and an impressive speedup over linear and kernel DA, and their incremental variants. Moreover, as shown by the experimental results, the combination of IAKDA with pre-trained convolutional neural networks (CNNs) for feature representation provides excellent classification performance and several orders of magnitude faster training times in comparison to retraining CNNs at each incremental step. In summary, the main contributions of this paper are: • A novel incremental nonlinear DR method is proposed, which in contrast to the previous approaches deals effectively with nonlinear learning problems and does not require zero mean data normalization. • As demonstrated by the experimental evaluation, the proposed method achieves state-of-the-art training time performance and improved detection accuracy in various multimedia tasks. The rest of the paper is structured as follows: The problem formulation is given in Section 2. Section 3 describes the application of 1 Source code is made publicy available at https://www.iti.gr/~bmezaris/downloads.html traditional batch KDA in the incremental learning problem, while the proposed IAKDA is presented in Section 4. In Section 5, evaluation results on various datasets are provided. Finally, conclusions are drawn in Section 6.

PROBLEM FORMULATION
In incremental learning, at each time t the overall dataset X t consists of a previously existing dataset,X t = [x 1 , . . . ,xŇ t ], and a new datasetX t = [x 1 , . . . ,xÑ t ] containing the observations fetched at time t, where,x κ ,x κ ∈ R L , L is the dimensionality of the input space R L , andŇ t ,Ñ t are the number of observations inX t andX t , respectively. Thus, the total number of observations at time t is N t =Ň t +Ñ t , and the entire dataset can be represented as We assume that the dataset is annotated, that is, each observation x κ is paired with a label y κ ∈ {1, . . . , C t }, where C t is the number of classes at time t. The target in incremental DR is to compute the DR transformation matrix at time t by exploiting the previously computed quantities at time t − 1, and the new datasetX t . This contrasts batch DR learning where the entire set X t is used as a whole ignoring the time information.

BATCH KDA
Conventional KDA [4,31,42], called hereafter batch KDA, is not designed for incremental learning problems. That is, for learning the DR transformation matrix at time t it uses the entire dataset X t (1), ignoring any computed quantities at time t − 1. To deal with nonlinearly-separable classes, the batch KDA utilizes a nonlinear mapping ϕ(·) from the input space to the so-called feature space R F of dimensionality F , associated with the Mercer kernel k(·, ·), so that the inner products in the feature space can be replaced with kernel function evaluations [21,30]. Given X t , it then seeks the linear transformation Γ t ∈ R F ×D t , D t ≪ F , that maximizes the following criterion where, tr(A), A T and A † denote trace, transpose and pseudoinverse of matrix A, respectively. In (2), Σ b,t and Σ w,t are the betweenand within-class scatter matrices at time t, and ϕ κ , N i,t , µ i,t , µ t , are the κth observation in R F , the number of observations and estimated sample mean of class i, and sample mean of the entire dataset, respectively. The optimization problem in (2) is equivalent to finding the nonzero eigenpairs (NZEP) of the following generalized eigenproblem (GEP) where, Γ t 's columns are the eigenvectors of the matrix pencil (Σ b,t , Σ w,t ), Λ t ∈ R D t ×D t is the diagonal eigenvalue matrix whose diagonal elements are sorted in descending order, and D t = rank(Σ b,t ).
To avoid working directly in R F , which may be of very high or infinity dimensionality, the transformation matrix is expressed as are the mapped training data, and Ψ t ∈ R N t ×D t is the expansion coefficient matrix. The following GEP is then considered which under mild condition [34] is equivalent to the one in (5). That is, we now need to identify the NZEP are the kernel matrices associated with the between-and within-class variability T is the kernel vector associated with the κth training observation, and K t,t is the kernel matrix of the training observations, We should note that Σ w,t and S w,t are often replaced by their total scatter matrix counterparts, respectively. Typically, the GEP in (6) is solved using techniques that compute the simultaneous diagonalization of S b,t and S w,t (or S m,t ) [19,38]. Given the identified linear transformation Ψ t , the projection z ∈ R D t of a test vector x ∈ R L can be then computed We observe that using the batch KDA approach described above, the various scatter matrices in (7), (8), (or (11)), and the GEP in (6) need to be computed for each point in time t that a new chunk of observations is fetched. Taking into account that the computational cost of KDA at each time period t is approximately 40N 3 t [17,18], we can easily conclude that the use of this method in incremental learning applications is impractical.

PROPOSED IAKDA
Given X t (1) at time t, IAKDA computes the coefficient matrix V t satisfying the following simultaneous diagonalization [17,18] where, That is, the computed W t satisfies the KDA optimization criterion presented in (2) [17,18,31,38]. IAKDA consists of an initialization and an incremental updating part, as described in the following subsections.

Initialization
The initialization step of IAKDA is based on AKDA [17,18]. Let where t = 1, be the initial dataset fetched by the learning system. Following AKDA, the respective scatter matrices Σ b,t (3), Σ w,t (4) and Σ m,t (10), can be factorized as and, J N t , I N t ∈ R N t ×N t are the all-one matrix and identity matrices. Above, n t , N t are the so-called strength vector and matrix, [R t ] n,i is the element in the nth row and ith column of the class indicator matrix R t ∈ R N t ×C t , and N 1/2 t = diag( N 1,t , . . . , N C t ,t ). Moreover, in (15), O t is the so-called core matrix, defined as where n T /2 t denotes the transpose of n where the columns of Ξ t ∈ R C t ×C t −1 are the eigenvectors of O t corresponding to its nonzero eigenvalues. As shown in [17,18], Θ t provides the simultaneous diagonalization of C b,t , C w,t and C m,t . The expansion coefficient matrix V t can then be computed by solving the following linear system It can be verified that the respective transformation W t provides the desired simultaneous diagonalization presented in (12), (13), (14).
For more details on the above mathematical derivations the interested reader is refereed to [17,18]. The kernel matrix K t,t is always symmetric positive definite or symmetric positive semidefinite. To this end, its Cholesky factorization is used to efficiently solve the linear system in (21). The overall procedure for the initialization of IAKDA is summarized in Algorithm 1. We should note that based on AKDA, the initialization step of IAKDA is very efficient offering an approximately O(N 3 t /3) computational complexity dominated by the Cholesky factorization of K t,t [17][18][19]. More importantly, as we see in the next subsection, the above framework facilitates the design of an incremental updating step, offering a further speedup over batch KDA as well as AKDA.

Incremental updating
As defined in (1), be the existing and new dataset at time t, respectively. Thus, the overall dataset at time t is formed as follows In AKDA the overall dataset X t is used to obtain V t as shown in the initialization step above and described in more detail in [17,18]. In contrary, IAKDA computes V t very efficiently using the precomputed arrays at t − 1 and the new datasetX t , as explained in the following. The indicator matrix and the strength vector and matrix can be efficiently updated as follows where,Ř t ,ň t ,Ň t andR t ,ñ t ,Ñ t are the indicator matrix, strength vector and strength matrix associated withX t andX t , respectively. In case that not all existing classesČ t are represented inX t , or when new classes are introduced inX t , the corresponding arrays in (22), (23) and (24) are padded with zeros accordingly, in order to allow the execution of the respective matrix operations. Utilizing the updated n t , the core matrix O t can be computed using (19) and subsequently its eigenvector matrix Ξ t can be easily retrieved. Similarly, utilizing the updated R t , N t and Ξ t , the eigenvector matrix Θ t is obtained using (20). The kernel matrix K t,t = Φ T t Φ t can be represented as where, Φ t = [Φ t ,Φ t ],Φ t ,Φ t are the feature space representation of X t andX t , respectively,Ǩ t,t =Φ T tΦ t is the existing kernel matrix, already computed in time t − 1,K t,t =Φ T tΦ t , andK t,t =Φ T tΦ t . Based on the above formulation, the Cholesky factor update is computed using a block Cholesky framework [19] as explained in the following. Suppose that the lower triangular Cholesky factor L t of K t,t is expressed in block format as where, the blocks L 1,t ∈ RŇ t ×Ň t , L 2,t ∈ RÑ t ×Ň t , L 3,t ∈ RÑ t ×Ñ t are lower triangular matrices needed to be identified, and0 t is thě N t ×Ñ t all-zero matrix. Using (26), K t,t can be expressed as Equating blocks between (25) and (27) we geť From (28) we observe that L 1,t is identical to the Cholesky factoř L t ofǨ t,t , which has been already computed in the previous time period L 1,t =Ľ t . The next block of the Cholesky factor can then be easily retrieved by solving the triangular system in (29) yielding where A −T denotes the inverse of any matrix A T . Finally, by the following reformulation of (30), K t,t −K t,tǨ −1 t,tK T t,t = L 3,t L T 3,t , we observe that L 3,t is the Cholesky factor ofK t,t −K t,tǨ −1 t,tK T t,t . Using the computed Θ t and L t , the linear transformation V t is then retrieved by solving the following two triangular linear systems The incremental updating part of IAKDA is depicted in Algorithm 2. In summary, we see that R t (22), n t (23), N t (24), K t,t (25) and L t (26) are computed by fully exploiting their respective counterparts in the previous period of time, thus, achieving a significant speedup over the application of KDA or AKDA in the incremental learning problem. We also observe that in overall IAKDA has very good numerical properties due to the fact that it consists of a Algorithm 2 IAKDA: Incremental updating

EXPERIMENTAL EVALUATION
The proposed methods are evaluated using various datasets for the tasks of object, concept and event detection in images and videos, as described in the next subsections.

Datasets
For the experimental evaluation the following 7 datasets are used: a) AwA [25]: provides images matching the 50 animal categories in Osherson's animal/attribute matrix, b) Ayahoo [12]: consists of various object and concept images collected using the Yahoo! image search engine, c) Caltech101 [27]: contains images belonging to a wide variety of categories, such as airplane, chair and crocodile, d) Eth80 [26]: consists of 8 object classes, recorded from 41 different views, e) Imagenet [10]: this is a large-scale concept detection dataset, f) Med10 [20,36]: contains videos belonging to one of 3 real-world events or to the "rest-of-the-world" event, g) Office [43]: provides images of various resolution from the Amazon.com and an office environment captured using a webcam and a digital LSR camera. From each of the 6 latter datasets 3 images are depicted in Figure 2 (representative images from AwA dataset are not shown due to copyright reasons). In the med10 dataset, videos are represented in the input space R 101376 using the improved dense trajectory features [2,51]. For the rest of the datasets, the image descriptors corresponding to the 4096 neurons of DeCAF's 6-th layer [11] provided in [45,46] are used.

Experimental setup
For the evaluation in the task of event detection, the med10 dataset is already divided to 1744 training and 1742 testing videos. For the rest of the datasets, we randomly selected 100 observations from each class to create a training set, and used the rest of the observations to form the testing set. Moreover, classes of less than 200 observations were equally divided to training and testing sets. The resulted partitions for each dataset are shown in Table 1. Furthermore, each training set is divided into 10 equal partitions, where each partition contains the same number of observations per class. This is done in order to simulate a streaming application, where at each period of time a new chunk of observations is received. During the evaluation, IAKDA is compared with the incremental LDA (ILDA) presented in [24], and batch versions of LDA, KDA and the state-of-the-art AKDA [17,18]. For IAKDA our Matlab implementation is used. For ILDA the code provided in [24] is utilized, for LDA and KDA the efficient Matlab functions provided in [5,6] are exploited, while for AKDA the respective Matlab implementation of [17,18] is used. For their comparison in the various tasks, the DR methods are further combined with a linear support vector machine (LSVM) [50], which learns the desired target classes in the discriminant subspace. The above methods are also compared with batch versions of LSVM and KSVM directly applied in the input space. For both LSVM and KSVM the library provided in [7] is exploited. The Gaussian kernel is utilized for IAKDA, AKDA, KDA and KSVM. At time period t, the batch LDA, KDA, AKDA, LSVM and KSVM are trained using the overall set X t , consisting of the existing seť X t and the new set of observationsX t , as described for the batch KDA in Section 3. The different approaches are optimized using 3-fold cross-validation (CV), where at each fold the training set is randomly split to 30% learning set and 70% validation set. During the CV procedure, the kernel parameter γ of the Gaussian kernel . . , 7}, while for the SVM penalty term the following parameter space is used {0.1, 1, 10, 100}. The performance of the mth method at each dataset and time period is measured using the MAP measure,ρ m = 1 C C i=1 ϱ m,i , and the training time speedup over KDA,θ m = θ K DA /θ m , where ϱ m,i is the average precision [41] of the mth method at the ith class, and θ m is the training time of method m along C classes in the dataset. In our evaluation, for the med10 dataset all the C = 3 event classes are used. For the rest of the datasets, only the first C = 10 classes from each one are considered, while the observations belonging to the rest of the classes are regarded as instances of the rest-of-the-world class. Moreover, for the training time we consider only the time for building each classifier with the parameters fixed to the optimal ones derived using the CV procedure, i.e., the CV time is excluded. The evaluation was performed using Intel i7 3770K@3.5Ghz CPU, 32 GB RAM workstations with 64-bit Windows 7.

Results
The evaluation results in terms of MAP and training time speedup over KDA for each method and along all time periods are shown in Figures 3, 5 and 6. The MAP and training time speedup for the 1st, 5th and 10th time period (i.e. for the first 10th, half and the whole training dataset) are recorded in Tables 2 and 3, respectively.  Finally, the speedup diagrams depicting the speedup achieved from IAKDA over AKDA along all time periods in the different datasets are provided in Figure 4. From the obtained results we conclude the following: i) In the majority of the datasets and time periods, IAKDA and AKDA provide the best performance in terms of MAP. For example, in the two larger datasets, AwA and imagenet, at the 10th time period, IAKDA attains a MAP gain over the third best approach of more than 3%. This is due to the very good numerical properties of both methods, which in comparison to the other DR methods,  consist of few elementary matrix operations and very stable decomposition algorithms. ii) In AwA, ayahoo, imagenet, and to a smaller extent in cal-tech101, we observe that the kernel DR methods (KDA, AKDA and IAKDA) as well as the LSVM and KSVM applied directly in the input space, achieve a large MAP gain over the linear DR approaches (LDA, ILDA). This shows that the classification tasks in the above datasets are nonlinear, and, thus, the linear DR approaches fail to discover an adequate linear discriminant subspace. The same is true for the office and eth80 datasets, however, with the difference that now KDA underperforms as well. We observe that in this case the number of observations is relatively low, and therefore the small sample size problem is quite intense. In consequence, the KDA scatter matrices are severely ill-posed, which explains the poor performance of this approach in these two datasets. Finally, in the med10 we observe that all methods have quite similar MAP performance. To this end, we conclude that the use of improved dense trajectories has effectively removed nonlinearities in this dataset.
iii) We observe that IAKDA and AKDA attain the best and second best training time performance in the majority of the datasets.
Specifically, only LDA in the imagenet dataset provides a better performance after the 7th incremental step, which is expected due to the fact that for large-scale datasets (i.e. when N >> L) such as this one, LDA has a O(L 3 ) complexity. iv) An impressive training time speedup of IAKDA over the batch KDA is attained in all datasets. For instance, IAKDA is approximately 195, 95 and 69 times faster than KDA in the imagenet, AwA and caltech101 datasets, respectively. As shown in the various figures, this speedup is the aggregation of the speedup offered by the base method (AKDA) plus the additional speedup achieved by the proposed incremental variant (IAKDA). v) As shown in Figure 4, a linear speedup of IAKDA over its base method, AKDA, is attained. For instance, in the majority of the datasets at the 10th incremental step, IAKDA is already 2 times faster than AKDA. Considering the impressive training time efficiency of the base AKDA method [17,18], this is a considerable further improvement. Moreover, we observe that this improvement is achieved without sacrificing the detection performance at any of the datasets. vi) In comparison to deep learning, the proposed approach, i.e. the combination of IAKDA with pre-trained CNNs, is several orders of magnitude faster and it can be applied more efficiently to incremental learning problems. For instance, in our internal evaluations the computation of the initialization step for learning the TRECVID SIN 2013 concepts [3] required a few minutes, versus several hours for fine-tuning the pre-trained GoogleNet [44]. Moreover, preliminary results have shown that using the weights of pre-trained CNNs together with KDA-class algorithms for feature representation and LSVMs for classification outperforms the use of CNNs in the tasks of event and concept detection [1,29].
Finally, we should note that the most intensive parts of IAKDA can be parallelized using appropriate computing architectures, as for instance is shown in [1,2,19], providing even greater training time efficiency.

CONCLUSIONS
In this paper, IAKDA was proposed that exploits the simultaneous diagonalization framework presented in [17,18] and a variant of the block recursive Cholesky factorization to devise an incremental learning algorithm for dimensionality reduction. The experimental evaluation on various image/video datasets demonstrated the effectiveness of IAKDA in both learning time and classification accuracy. Future work includes the investigation of techniques for the efficient simultaneous optimization of IAKDA and CNNs [35] and the use of appropriate fine-tuning approaches, similarly to [40], to further improve performance.

ACKNOWLEDGMENTS
This work was supported by the EU's Horizon 2020 research and innovation programme under grant agreement H2020-732665 EMMA.