AKSDA-MSVM: A GPU-accelerated Multiclass Learning Framework for Multimedia

In this paper, a combined nonlinear dimensionality reduction and multiclass classification framework is proposed. Specifically, a novel discriminant analysis (DA) technique, called accelerated kernel subclass discriminant analysis (AKSDA), derives a discriminant subspace, and a linear multiclass support vector machine (MSVM) computes a set of separating hyperplanes in the derived subspace. Moreover, within this framework an approach for accelerating the computation of multiple Gram matrices and an associated late fusion scheme are presented. Experimental evaluation in five multimedia datasets, on tasks such as video event detection and news document classification, shows that the proposed framework achieves excellent results in terms of both training time and generalization performance.


INTRODUCTION
In high-dimensional machine learning problems such as multimedia classification, the curse-of-dimensionality becomes a major challenge: the complexity of the pattern recognition model grows exponentially with the dimensionality of the space [8,25]. A key observation for solving the curse-of-dimensionality problem is that most physical processes observed in some high-dimensional ambient space adhere to another, low-dimensional manifold. Hence, the curse-of-dimensionality can be alleviated by first identifying and then operating in this "intrinsic" low-dimensional space. Based on this fact, a major research direction is the investigation of frameworks that combine manifold learning with classification approaches [3,29,12,15,22,19].
Kernel discriminant analysis (KDA) is a powerful class of manifold learning techniques (e.g. KFD [20], GDA [2]) that combined with linear (or piece-wise linear) classifiers have provided very good results in very challenging tasks. Moreover, further performance improvements have been achieved Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
MM '16, October 15-19, 2016 by subclass extensions of conventional KDA approaches, which by imposing a less strict requirement for the feature mapping are able to identify a more discriminant subspace (e.g. KSDA [28], KMSDA [13]). In the resulting subspace, it is then possible to use lower-capacity classifiers, such as linear SVMs (LSVMs), improving the generalization performance of the overall system [25].
Despite the above advances, the application of subclassbased KDA approaches to large-scale problems remains computationally challenging. The core of this limitation lies in the computation of the Gram matrix and the solution of the generalized eigenvalue problem. Recently, accelerated generalised subclass discriminant analysis (AGSDA) and its GPU implementation have been proposed, alleviating the above drawbacks and achieving state-of-the-art results in a variety of problems [1]. For example, its combination with LSVM (AGSDA-LSVM) has resulted in increased precision and orders of magnitude faster training times over LSVM and KSVM [1] in the problems of concept and event detection in video. Furthermore, earlier approaches such as SRKDA [4] and GSDA [10] (upon which AGSDA was based) were previously already shown to provide improved performance over many other KDA approaches.
The AGSDA-LSVM described above resolves many problems of subclass-based KDA approaches, but still suffers from the following limitations: i) it has been developed only for two-class problems and zero-mean datasets, ii) the GPU acceleration of AGSDA-LSVM has been proposed only in conjunction with the Gaussian RBF kernel. To this end, in this paper we propose AKSDA-MSVM that extends AGSDA-LSVM for multiclass classification, and datasets that are not necessarily normalized to have zero mean. Moreover, by carefully reformulating the Gram matrices of three popular kernels, we show how they can be simultaneously computed, exploiting both GPU and multi-core CPU acceleration. The experimental evaluation of AKSDA-MSVM on several large-scale datasets for both multimedia classification and retrieval tasks shows that the proposed approach achieves excellent performance in terms of both training time and generalization accuracy. Summarizing, the main contributions of this paper are: • To the best of our knowledge, this is the first work that combines nonlinear discriminant analysis with multiclass support vector machines for multimedia classification.
• A method for the parallel computation of multiple Gram matrices is proposed, exploiting GPU and multicore CPU acceleration.
• The proposed approach achieves very good performance in terms of training time and generalization performance. Moreover, its software implementation is made freely available to the scientific community 1 .
The rest of the paper is structured as follows: In Section 2, the proposed AKSDA-MSVM approach is described. Experimental results are presented in Section 3, while Section 4 concludes the paper.

AKSDA
Let X = {(xn, (ωn, υn)), n = 1, . . . , N } be a subclass partition of an annotated training set, where xn ∈ R L is the n-th training observation in the L-dimensional input space, ωn ∈ {1, . . . , Ω}, υn ∈ {1, . . . , Υω n } are its class and subclass labels, Υω n is the number of subclasses of class ωn, and Ω, N are the total number of classes and observations respectively. The partitioning of classes to subclasses can be performed fully automatically (in our experiments this is done using k-means). Moreover, the training set is ordered in ascending order according to the class and subclass labels (i.e., ωn ≤ ωn+1, and within a given class it is υn ≤ υn+1). Given X , AKSDA solves the generalized eigenvalue problem: where, Kc, [Kc]r,q = kc(xr, xq) is the Gram matrix of the training set associated with a Mercer kernel function kc(·, ·) : R L × R L → R, A is the between-subclass factor matrix, whose element [A]r,q corresponding to samples xr, xq is defined as Nω r ,υr Nω q ,υq , otherwise, Nω r ,P (ωr), Nω r ,υr ,P (ωr, υr) are the number of observations and estimated prior probabilities of class ωr and subclass (ωr, υr) respectively, B = I − 1 N J is the total factor matrix, I and J are N × N identity and all-ones matrices, Ψc ∈ R N ×D is the column-orthogonal eigenvector matrix, Λc ∈ R D×D is the diagonal matrix with sorted eigenvalues in its diagonal, D = Υ − 1 is the dimensionality of the projection subspace 2 , and Υ is the total number of subclasses. Following [11], the above problem can be solved in two steps: a) identifying the eigenpairs (V, ∆), V ∈ R N ×D , ∆ ∈ R D×D of A, b) obtaining Ψc by solving the following linear matrix system The eigenvector matrix V in the first step of AKSDA can be efficiently computed following [11]. The projection z of an observation x in the discriminant subspace can then be computed as 1 Software: http://mklab.iti.gr/project/aksda 2 Kc may be semidefinite positive matrix and in this case D ≤ Υ − 1. However, we assume that Kc is positive definite, which can be easily accomplished through regularization [11].

GPU-accelerated computation of multiple Gram matrices
One of the most computationally expensive parts of AKSDA (and of most kernel-based approaches) in terms of both memory consumption and learning time is the calculation of the Gram matrix Kc. In [1], a tiled general matrix multiplication (GEMM) approach was proposed for the efficient computation of the Gram matrix with Gaussian RBF kernel. Inspired from [1], we propose a method for accelerating the computation of multiple Gram matrices. For illustration purposes, we examine the application of the proposed method in accelerating the computation of the Gram matrices associated with the Gaussian RBF, t-student and Cauchy kernels where exp() is the exponential function, and γ, d, σ are the respective kernel parameters. Rearranging the above we get where sgm ι (υ) = (1+υ ι ) −1 is a sigmoid scalar function, and urq = xr − xq 2 . Considering exp(), sgm ι () as elementwise matrix operators, the respective Gram matrices can be expressed as The most computationally expensive part above is the computation of matrix D, defined as D = X T X + E, where X is the training data matrix, E = F + C, C = F T , and the elements of F are defined as [F]rq = x T r xr, ∀ r, q. As shown in [1], D can be easily partitioned to an arbitrary number of tiles, which can be computed in parallel by exploiting the GEMM function of CUDA's cuBLAS library. Furthermore, the element-wise matrix operations in (4) can also be parallelized by exploiting a multi-core CPU. Note that by appropriately formulating other kernel functions (e.g. the Inverse Multi-quadric kernel 3 ) an arbitrary number of different Gram matrices can be parallelized.

Classification
Let Z = {Zc, c = 1, . . . , C} be a training dataset of C subsets, where each subset Zc = {(zn,c, ωn), n = 1, . . . , N }, zn,c = Ψ T c [kc(x1, xn), . . . , kc(xN , xn)] T ∈ R D is derived from X using AKSDA and a specified kernel function kc(·, ·). Given the above dataset, we define a linear model for each class and kernel function fi,c(zn,c) = w T i,c zn,c, where wi,c ∈ R D is the weight vector referring to i-th class and c-th kernel function 4 . For the identification of the weight vectors we utilize the linear MSVM approach [16,6]. That is, the following optimization problem is solved using the respective training subset Zc ξn,c, subject to the constraints (wω n ,c − wi,c) T zn,c ≥ ei,n − ξn,c, ∀i, n where, ξn,c ≥ 0 is the slack variable corresponding to zn,c, R > 0 is the penalty term, ei,n = 1−δi,n, and δi,n is the class indicator function, i.e., δi,n = 1 if ωn = i, δi,n = 0 otherwise. Given the learned models in (5)  Then, assuming equiprobable priors for each class, the sum rule [17] can be applied to yield the overall posterior probability for the i-th clasṡ and the following rule is used for classifying test observations

Compared methods
We experimentally compare the following methods: i) MSVM-1: Multiclass LSVM implementation of libsvm [5]. We use the Matlab version (compiled again in our evaluation workstation, so that all cores of the machine are exploited 5 ). Matlab makes use of Intel's state-of-the-art numerical libraries, which speed up this method significantly.
ii) Amazon: It consists of 1500 reviews posted by 50 Amazon customers. Each review is represented by a 10000dimensional vector. A random partition to 1200 training and 300 test observations is used.
iii) YouTube: It consists of 30 video game classes and up to 13 feature types per observation. We only use the audio MFCC features (2000-dimensional), and a subset of 20000 observations from the overall training set. For evaluation, the entire test set of 17177 instances is employed. iv) News20: A collection of approximately 20000 news documents from 20 different newsgroups, used for text classification. A partition of 15935 training and 3993 test observations in R 62061 is provided, already scaled. It is a very sparse dataset with approximately 0.13% non-zero elements. v) Sector: A collection of 9619 corporate web pages organized into 105 categories based on each company's commercial activity. The scaled version of the dataset, partitioned to 6412 training and 3207 test observations in R 55197 , is used. It is very sparse, with only 0.3% non-zero elements.

Experimental setup
For all tested methods, model selection is done using 3-fold cross-validation, where at each fold the training set is split to 30% learning set and 70% validation set.

Results
The retrieval evaluation results in terms of mean average precision (MAP) and the respective times on GTX 650 are  [5] 38.94% 81.14/7.3 MSVM-2 [5] 49.14% 118.7/7.58 MSVM-3 [7] 28.14% 0.78/0.018 AKSDA-MSVM-1 49.28% 0.57/0.087 AKSDA-MSVM-2 51.6% 2.45/0.337    shown for the YouTube and MED12 video datasets, in Tables 1 and 2. It should be noted that due to higher RAM size requirements, MSVM-1 to -3 could not run for the MED12 dataset. The classification results in terms of correct classi-fication rate (CCR) and the respective times (on GTX 650) for the Amazon, News20 and Sector datasets, are presented in Table 3. The training and test times of AKSDA-MSVM-1  & -2 on GTX TITAN, for all datasets, are shown in Table  4. Finally, the distribution of the training time of AKSDA-MSVM-1 along its individual components for each dataset, and the overall training time on GTX 650 and GTX TI-TAN, are shown in Figures 1 and 2, respectively. From the obtained results we observe the following: i) AKSDA-MSVM-2, followed by AKSDA-MSVM-1, achieve the best generalization performance in all datasets.
ii) In terms of training time (even on GTX 650), AKSDA-MSVM-1 outperforms the state-of-the-art liblinear SVM implementation (MSVM-3) in three out of four datasets, while AKSDA-MSVM-2 (despite the computation of four Gram matrices) is still much faster than the MSVMs of libsvm (MSVM-1 & -2) in all datasets (Tables 1 & 3). Moreover, an impressive training time speedup (more than one order of magnitude) of the proposed method over the GPU accelerated AGSDA-LSVM [1] is observed in the MED12 dataset (Table 2). This is mainly due to the fact that AKSDA-MSVM solves one large eigenproblem (which is done efficiently using the proposed GPU-accelerated framework) to obtain a transformation matrix Ψc common for all (Ω) classes (see Sections 2.1, 2.2), while AGSDA-LSVM computes the GPU-accelerated solution of Ω eigenproblems to obtain one transformation matrix for each class. Finally, as expected, further improvement in training/test times is observed with the high-end GTX TITAN GPU (Table 4).
iii) From the analysis in Figure 1 we observe that the subclass partitioning, Gram matrix computation and linear system solver are the most intensive parts of the proposed method. Moreover, we can see that the linear system solver and the Gram matrix computation are mostly affected by the number of training observations and the feature vector dimensionality, respectively. iv) In Figure 2, it is shown that in terms of training time the proposed method scales very well both with the number of training observations N and with feature dimensionality L. Specifically, considering the training times required for Sector and News20 when using GTX TITAN (thus overlooking any hardware limitations of the GTX 650), we see that for a similar feature vector size, doubling the number of training samples results in a ×3 increase in training time, instead of the expected ×8 increase of conventional KDA (due to its O(N 3 ) complexity). Additionally, comparing the results for Sector and MED12, we see that despite a +100% increase in feature vector size and +50% increase in number of samples, the training time increases only by +50%.

CONCLUSIONS
A GPU-accelerated multiclass learning framework was presented providing very good performance in several multimedia-centered machine learning tasks. Future work directions include the extension of the proposed framework for multiple kernel learning [27,9,23] and the combination of AKSDA and MSVM in a single optimization problem [24].

ACKNOWLEDGEMENTS
This work was supported by the EU's FP7 and Horizon 2020 research and innovation programmes under grant agreements FP7-600826 ForgetIT and H2020-687786 InVID.