Linear Maximum Margin Classifier for Learning from Uncertain Data

In this paper, we propose a maximum margin classifier that deals with uncertainty in data input. Specifically, we reformulate the SVM framework such that each input training entity is not solely a feature vector representation, but a multi-dimensional Gaussian distribution with given probability density, i.e., with a given mean and covariance matrix. The latter expresses the uncertainty. We arrive at a convex optimization problem, which is solved in the primal form using a gradient descent approach. The resulting classifier, which we name SVM with Gaussian Sample Uncertainty (SVM-GSU), is tested on synthetic data, as well as on the problem of event detection in video using the large-scale TRECVID MED 2014 dataset, and the problem of image classification using the MNIST dataset of handwritten digits. Experimental results verify the effectiveness of the proposed classifier.


I. INTRODUCTION
S UPPORT Vector Machine (SVM) has been shown to be a powerful paradigm for pattern classification. The origins of SVM can be traced back to [1], [2]. In [3], Vapnik established the standard regularized SVM algorithm where a linear discriminative function is computed in order to achieve maximum sample margin. To this end, a penalty term approximating the total training error is considered along with a regularization term, typically chosen as a norm of the classifier, in order to avoid the so-called over-fitting phenomenon. From a statistical learning theory point of view, this is interpreted as follows: the regularization term restricts the complexity of the classifier and thus the deviation of the testing error. Hence, the training error is controlled (see e.g. [4], [5], [6]). The training data are assumed to be drawn from some unknown probability distribution; specifically, they are assumed to be independently drawn and identically distributed ("iid").
The majority of the classification methods do not address the uncertainty in the training data explicitly. That is, each training sample is described by its position in some vector space (feature representation). However, such an approach often does not express the true underlying process of extracting the feature representation. Errors are often introduced during sensing or feature extraction and therefore the training data are noisy. In this work, we model the uncertainty of each training example using a multivariate Gaussian distribution, such that the covariance matrix of each distribution is treated as a measure of this uncertainty. That is, we model each input example as a random vector following a multivariate Gaussian distribution with given mean vector and covariance matrix. In Fig.1 we can see such 2D training examples, given as bivariate Gaussian distributions with certain mean vectors and covariance matrices. For the sake of visualization, we illustrate the uncertainty of each input training vector with the shaded regions, which are bounded by the iso-density loci of points (ellipses) described by the 0.03% of the maximum density of each distribution. A novel SVM formulation is developed, by modifying appropriately the mechanism for measuring the classification (empirical) error and for taking it into account during training. Hereafter, the proposed algorithm will be called SVM with Gaussian Sample Uncertainty (SVM-GSU). The toy example in Fig.1 illustrates the motivation behind the proposed SVM-GSU. That is, the decision boundary of the SVM-GSU, shown with a solid line, may be drastically different than that of the standard SVM, shown with a dashed line, when taking into account the uncertainty associated with each input data.
The remainder of this paper is organized as follows. In Section II, we review related work. In Section III, we present the proposed SVM-GSU. In Section IV, we provide the experimental results of the application of SVM-GSU to synthetic data, the TRECVID MED 2014 dataset, as well as the MNIST dataset, along with comparisons with the standard SVM, and other state of the art methods. We discuss conclusions in Section V.

II. RELATED WORK
Assuming uncertainty in input under the SVM paradigm is not new. Different types of Robust SVMs have been proposed in several recent works. Bi and Zhang [7] considered a statistical formulation where the input noise is modeled as a hidden mixture component, but in this way the "iid" assumption for the training data is violated. In that work, the uncertainty is modeled isotropically. Second order cone programming (SOCP) [8] methods have also been employed in numerous works to handle missing and uncertain data. In addition, Robust Optimization [9], [10] techniques have been proposed for optimization problems where data is not specified exactly, but it is known to belong to a given uncertainty set U, yet the constraints of the optimization problem must hold for all possible values of the data from U.
Lanckriet et al. [11] considered a binary classification problem where the mean and covariance matrix of each class are assumed to be known. Then, a minimax problem is formulated such that the worst-case (maximum) probability of misclassification of future data points is minimized. That is, under all possible choices of class-conditional densities with a given mean and covariance matrix, the worst-case probability of misclassification of new data is minimized. For doing so, the authors exploited generalized Chebyshev inequalities [12] and particularly a theorem according to which the probability of misclassifying a point is bounded.
Shivaswamy et al. [13], who extended Bhattacharyya et al. [14], also adopted a second order cone programming formulation and used generalized Chebyshev inequalities to design robust classifiers dealing with uncertain observations. Then uncertainty arises in ellipsoidal form, as follows directly from the multivariate Chebyshev inequality. This formulation achieves robustness by requiring that the ellipsoid of every uncertain data point should lie in the correct half-space. The expected error of misclassifying a sample is obtained by computing the volume of the ellipsoid that lies on the wrong side of the hyperplane. However, this quantity is not computed analytically; instead, a large number of uniformly distributed points are generated in the ellipsoid, and the fraction of the number of points on the wrong side of the hyperplane to the total number of generated points is computed.
Xu et al. [15], [16] considered the robust classification problem for a class of non-box-typed uncertainty sets, in contrast to [14], [13], [11], who robustified regularized classification using box-type uncertainty. That is, they considered a setup where the joint uncertainty is the Cartesian product of uncertainty in each input, leading to penalty terms on each constraint of the resulting formulation. Furthermore, Xu et al. gave evidence on the equivalence between the standard regularized SVM and this robust optimization formulation, establishing robustness as the reason why regularized SVMs generalize well.
In [17], motivated by GEPSVM [18], Qi et al. robustified a twin support vector machine (TWSVM) [19]. Robust TWSVM deals with data affected by measurement noise using a second order cone programming formulation. In their work, input data is contaminated with isotropic noise (i.e., spherical disturbances centred at the training samples), and thus cannot model real-world uncertainty, which is typically described by more complex noise patterns. Our proposed classifier, which is presented below, does not violate the "iid" assumption for the training input data, while it can model the uncertainty of each input training example using an arbitrary covariance matrix, consequently permitting the uncertainty to be anisotropic. Moreover, the expected error is computed analytically and is minimized by an iterative gradient descent algorithm whose complexity is linear with respect to the number of training data. Finally, we apply a linear subspace learning approach in order to solve the problem in lower-dimensional spaces, and thus accelerate the training stage. Learning in subspaces is widely used in various statistical learning problems [20], [21], [22], [23].

III. PROPOSED APPROACH
As discussed above, in this section we develop a new algorithm, in which the training set that feeds the proposed classifier includes training examples described not solely by a set of feature representations, i.e. a set of vectors x i in some ndimensional space, but rather by a set of multivariate Gaussian distributions; that is, every training data is characterized by a mean vector x i ∈ D and a covariance matrix Σ i ∈ S n ++ * . A linear formulation is proposed below, while an approximation formulation dealing with learning in linear subspaces is discussed next.

A. SVM with Gaussian Sample Uncertainty (SVM-GSU)
Let us briefly begin with the baseline SVM algorithm, which will endow us with arguments necessary for generalizing and proceeding to the proposed approach. We consider the supervised learning framework where a set of l annotated observations is available. That is, each observation consists of a vector, x i , in some n-dimensional vector space, let D ⊆ R n † , and an associated label, y i ∈ {±1}. Let us denote the training set by X = (x i , y i ) : x i ∈ R n , y i ∈ {±1}, i = 1, . . . , l . Then, the baseline linear SVM [3] learns a hyperplane H : w · x + b = 0 that minimizes with respect to w, b the following objective function where h(y, t) = max(0, 1 − yt) is known as the "hinge loss" function [24]. * D is typically a subset of the n-dimensional Euclidean space of column vectors, while S n ++ denotes the convex cone of all symmetric positive definite n × n matrices with entries in D ⊆ R n . † For the rest of this paper, we will assume that D ≡ R n .
In this work, we assume that instead of the i-th training example, we are given a multivariate Gaussian distribution with mean vector x i , and covariance matrix Σ i . One could think of this as that the covariance matrix, Σ i , describes the uncertainty about the position of the training sample around x i . Formally, we define random variables, X i , each of which follows an n-dimensional Gaussian distribution with mean vector x i ∈ R n , and covariance matrix a symmetric positive definite n × n matrix, Σ i ∈ S n ++ . The probability density function (pdf) of the i-th Gaussian distribution is given by (2) Adopting the above assumption for the input training vectors, we can express the training set as a set of l annotated Gaussian distributions, i.e., X = (x i , Σ i , y i ) : x i ∈ R n , Σ i ∈ S n ++ , y i ∈ {±1}, i = 1, . . . , l . The optimization problem, in its unconstrained form, is then formulated as follows where Ω i denotes the half-space of R n that is defined by the hyperplane H : y i (w · x + b) = 1 as Ω i = {x ∈ R n : y i (w · x + b) ≤ 1}, and is the half-space to which a misclassified sample lies. Note that the loss function L : R n × R × R n × S n ++ × {±1} → R that can be defined for the samples drawn from the i-th Gaussian, that is, is the expected value of the hinge loss. Using the Theorem 1 proved in Appendix A, for the half-spaces Ω + i = x ∈ R n : w·x+b−1 ≥ 0 , and Ω − i = x ∈ R n : w·x+b−1 ≤ 0 , the above integral is evaluated in terms of w and b as follows where erf : R → (−1, 1) is the error function, defined as As stated above, the covariance matrix of each training random vector describes its uncertainty, and as the covariance matrix approaches to the zero matrix, the certainty increases. At the extreme, as Σ i → 0, after applying function analysis, (6) , which is the hinge loss function used in the standard SVM formulation [3], [25], [24]. That implies that the proposed formulation is a generalization of the standard SVM; the two classifiers are equivalent when the covariance matrices tend to the zero matrix ‡ . Let J : R n × R × R n × S n ++ × {±1} → R be the objective function of the SVM-GSU formulation, i.e., which is convex as proved in Appendix B.
To solve the convex optimization problem (4), the Limitedmemory BFGS (L-BFGS) algorithm has been employed § . L-BFGS belongs to the family of quasi-Newton methods and approximates the BFGS algorithm [26] using a limited amount of memory. L-BFGS requires the first-order derivatives with respect to the optimization variables w, b. Then, the objective function is minimized jointly for w, b and a (global) optimal solution is achieved. By differentiating J with respect to w and b, we obtain, respectively, (9) By applying L-BFGS on the problem of (4), we obtain the optimal values of the parameters w, b defining the SVM-GSU's learned separating hyperplane.
Then, given this hyperplane H : w · x + b = 0, an unseen testing datum, x t , is classified to one of the two classes according to the sign of the (signed) distance between x t and the separating hyperplane. That is, the predicted label of x t is computed as y t = sgn(d t ), where d t = (w·x t +b)/ w , while a probabilistic degree of confidence (DoC) that the testing sample belongs to the class to which it has been classified can be calculated using the well-known sigmoid function, . This is the same approach that is used in the baseline linear SVM formulation [27] for evaluating a sample's class membership at the testing phase.

B. Solving the SVM-GSU in linear subspaces
Since learning in the original n-dimensional input space may introduce computationally expensive terms, in this section we propose a methodology for approximating the loss function of SVM-GSU, by projecting each input random vector into a linear subspace. The dimensionality of each subspace is defined by preserving a given fraction of the total variance for each covariance matrix. Then, the total loss, as well as ‡ A zero covariance matrix exists due to the well known property that the set of symmetric positive definite matrices is a convex cone with vertex at zero. § A framework for training and testing the linear SVM-GSU has been developed in C and is publicly available at <withheld during reviewing>.
its first derivatives, are computed separately in each subspace. A comprehensive analysis of the above method is discussed below.
By performing eigenanalysis in the covariance matrix of X i , the latter is decomposed as follows where Λ i is an n × n diagonal matrix consisting of the while U i is an n×n orthonormal matrix, whose j-th column, u j i , is the eigenvector corresponding to the j-th eigenvalue, λ i j . Let us keep the first d i ≤ n eigenvectors, such that a certain percentage (e.g. = 90%) of the total variance is preserved, i.e. Then, we construct the n × d i matrix U i by keeping the first d i columns of U i , i.e., Now, by using the matrix P i = U i ∈ R di×n , we define a new random vector Z i , such that and covariance matrix The probability density function of Z i is given by Let us now see how the integral in (4) is approximated in the new space. To this end, the following holds true Consequently, the integral in the RHS of (4) can be approximated by the quantity where Ω z i denotes the projected half-space on R di , that is, Using Theorem 1, which is proved in Appendix A, the above integral is equal to Therefore, for each training example (i.e., for each random vector that follows a given Gaussian distribution), the loss function L z i : Therefore, the objective function J : R n × R → R, given by (7) can be approximated as follows Following similar arguments as in the case of learning in the original space, J can be shown to be convex (see Appendix B). The first derivative of J with respect to w is given as follows Thus, by using the chain rule, where Moreover, the first derivative of J with respect to b is given as follows At the implementation level, for solving the SVM-GSU in linear subspaces, the eigenanalysis of the covariance matrices Σ i is performed only once per each Gaussian distribution, before the optimization procedure begins. Consequently, the following orthonormal matrices and vectors are computed once: Then, for each iteration of (w, b) ∈ R n+1 , and for each training example (distribution), (x i , Σ i ), the projected (normal to the separating hyperplane) vector has to be computed: Finally, the loss function is computed in the low-dimensional spaces R di , i = 1, . . . , l as shown in (17). The objective function is computed as shown in (18), while its first derivatives are computed as in (21), (22).

IV. EXPERIMENTS
The classification performance of the proposed algorithm is initially validated on 2D synthetic data, in order to illustrate how the linear SVM-GSU classifier works. To this end, we consider binary classification toy experiments and validate on them the proposed learning algorithm both in the original feature space, as well as in linear subspaces.
Next, the proposed algorithm is applied to two different, challenging learning problems, i.e., the problem of complex event detection in video, and the problem of image classification of handwritten digits. The large video dataset of the TRECVID Multimedia Event Detection (MED) 2014 task is used for the event detection experiments (Sect. IV-B), while the well-known MNIST database of handwritten digits is used for the image classification ones (Sect. IV-C). For each of those problem domains, a methodology for modeling the uncertainty of each input (random) vector is also proposed.

A. Toy examples using synthetic data
In this subsection, we present two toy examples that provide insights into understanding the way the proposed algorithm works. As shown in Fig.2, two toy artificial binary classification problems are constructed. Negative samples are denoted by red × marks, while positive ones by green crosses. We assume that the uncertainty of each training example is given via a covariance matrix. In Fig.2a and 2c, the ellipses show the iso-density loci of points described by the 0.03% of the maximum density of each Gaussian distribution (please note that these ellipses are only used for visualization purposes). Moreover, in Fig.2b and 2d, the covariance matrices are approximated by low-rank matrices (rank one).
For each of the above experiments, a linear baseline SVM (LSVM) is trained using solely the centres of the distributions; i.e., ignoring the uncertainty of each sample. The resulting separating lines are shown in Fig.2 in dashed red. Moreover, a linear SVM-GSU (LSVM-GSU) is also trained using the centres of the above distributions, and the covariance matrices; i.e., using the parameters of the Gaussian distribution followed by each training example. LSVM-GSU is trained first in the original feature space (R 2 ), and then in linear subspaces (R), preserving for each covariance matrix 90% of the total variance. The resulting separating lines virtually coincide and are shown in Fig.2a and 2c (solid green lines). Finally, the resulting separating lines of the SVM-GSUs trained in linear subspaces using the low-rank (rank one) covariance matrices and preserving 90% of the total variance are shown with green lines in Fig.2b and 2d. It is evident that, when the uncertainty of the training data is taken into consideration, the decision boundaries may change drastically. Finally, the proposed algorithm achieves to learn approximately the same (or a very similar) separating line, even in the cases where the optimization problem is approximated in linear subspaces, or the covariance matrices of the input vectors are low-rank. • Evaluation Set -∼ 50 positive samples per event class, -2496 background samples (negative for all event classes). A model vector representation scheme is adopted, similarly to [29], for representing videos. That is, a set of 346 preexisting visual concept detectors (linear SVM classifiers that are trained on the TRECVID Semantic Indexing (SIN) 2014 dataset [29], [28]) is used for deriving a 346-element descriptor vector for each video (hereafter called "model vector"). Specifically, each input video stream is initially sampled such that a keyframe is generated every 6 seconds. Next, each keyframe is processed as discussed above and a keyframelevel model vector is computed. Then, a video-level model vector for each video is computed by taking the average of the corresponding keyframe-level representations. Thus, the keyframe-level model vectors can be seen as different observations of the model vector which represents each video.
2) Uncertainty modeling: Let us now define a set X of l annotated random vectors representing the aforementioned video-level model vectors. Each random vector is distributed normally; i.e., for the random vector representing the ith video, . . , l}. For each random vector X i , a number, N i , of observations, {x t i ∈ R n : t = 1, . . . , N i } is available (these are the keyframe-level model vectors that have been computed). Then, the mean vector and the covariance matrix of X i are computed respectively as follows However, the number of observations per each video that are available for our dataset is in most cases much lower than the dimensionality of the input space; for instance, the average number of observations available for each random vector (video-level representation) is approximately 20 model vectors (keyframe-level representations), while the dimensionality of the input space is n = 346. Consequently, the covariance matrices that arise using (24) are typically low-rank; i.e. rank(Σ i ) ≤ N i . To overcome this issue, we assume that the desired covariance matrices are diagonal. That is, we require that the covariance matrix of the i-th training sample is given by Σ i = diag σ 1 i , . . . ,σ n i , such that the squared Frobenious norm of the difference Σ i − Σ i is minimized, i.e., It can easily be shown that the above criterion is fulfilled when the estimator covariance matrix Σ i is equal to the diagonal part of the sample covariance matrix Σ i , i.e.
. . , σ n i . We note that, using this approximation approach, the covariance matrices are diagonal but anisotropic and different for each training input example. This is in contrast with other methods (e.g. [7], [17]) that assume more restrictive modeling approaches for the uncertainty; i.e., isotropic noise for each training sample.
3) Experimental results: Table I shows the performance of the proposed linear SVM-GSU (LSVM-GSU) in terms of average precision (AP) [30] for each target event in comparison with the baseline linear SVM (LSVM), as well as with a linear SVM extension which handles the input uncertainty isotropically (LSVM-isotropic) as in [7], [17]. Moreover, for each dataset, the mean average precision (MAP) across all target events is reported. The optimization of the C parameter for both LSVM and LSVM-GSU is performed using a line search on a 3-fold cross-validation procedure, where at each fold the training set is split to 70% learning set and 30% validation set.
In Table I, column (a) shows the performance of the baseline LSVM when training is carried out using keyframelevel model vectors. That is, in this experimental scenario we attempt to resemble the case where a standard LSVM is trained using all the available observations of each training distribution, in contrast with the proposed LSVM-GSU, where training is carried out using solely the mean vectors and the covariance matrices. In column (b), we report the results of the standard LSVMs which were trained using the videolevel representations; that is, solely the mean vectors of each distribution. In contrast, by modeling the uncertainty as described in the previous section, the proposed LSVM-GSU is validated both in the case that learning is carried out in the original feature space (column (h)), and in the cases that it is approximated in linear subspaces by preserving a certain fraction (p) of the total variance of each covariance matrix. Columns (d)-(g) show the performance of LSVM-GSU when p = 0.75, 0.90, 0.95, and 0.99, respectively. The performance of the SVM extension, described in [7], [17], where uncertainty is modeled isotropically (LSVM-isotropic) is given in column (c). The bold-faced numbers indicate the best result achieved for each event class. Finally, in column (i), the results of the McNemar [31], [32], [33] statistical significance test are reported. A * denotes statistically significant differences between the proposed LSVM-GSU (learning in original space) and baseline LSVM, while a ∼ denotes statistically significant differences between LSVM-GSU and LSVM-isotropic.
From the obtained results, we observe that the proposed algorithm (learning in the original feature space) achieved better detection performance than both LSVM and LSVMisotropic for 22 out of the 30 event classes. The relative boost between LSVM-GSU and LSVM, achieved for each event class, is shown in column (j) of Table I, while the overall best relative performance boost (in MAP) is equal to 9.83% and is achieved when LSVM-GSU is learned in the original feature space. However, it is worth noting that a considerable boost was also achieved when the LSVM-GSU is approximated in linear subspaces by preserving the 99% of the total variance for each covariance matrix. Furthermore, in general we observe that, as the fraction of the total variance preserved decreases, the overall detection performance also decreases.
C. Hand-written digit classification 1) Dataset and experimental setup: The proposed algorithm is also validated on the problem of image classification using the MNIST dataset of handwritten digits [34]. The MNIST dataset provides a training set of 60000 samples (approx. 6000 samples per digit), and a test set of 10000 samples (approx. 1000 samples per digit). Each sample is represented by a 28 × 28 8-bit image. Originally, MNIST does not provide any information about the uncertainty of each image; some typical examples of the original training and testing set images are shown in Fig.3a.
In order to make the dataset more challenging, as well as to model a realistic distortion that may happen to this kind of images (scanned handwritten digits), the original MNIST dataset was "polluted" with noise. More specifically, each image example was rotated by a random angle uniformly drawn from the range [−θ, +θ], where θ is measured in degrees. Moreover, each image was translated by a random vector t uniformly drawn from [−t p , +t p ] 2 , where t p is a positive integer expressing distance that is measured in pixels. We created five different noisy datasets by setting θ = 15°a nd t p ∈ {3, 5, 7, 9, 11}. The polluted datasets (D 1 to D 5 , respectively) are shown in Table II, where D 0 denotes the original MNIST dataset. Fig. 3b and 3c show illustrative examples of the noisy datasets D 2 (θ = 15°, t p = 5) and D 5 (θ = 15°, t p = 11), respectively. Experiments with θ in range [5°, 25°] gave very similar results, thus we chose to solely report the results that correspond to θ = 15°.
We create six different experimental scenarios using the above datasets (D 0 -D 5 ). First, we defined the problem of discriminating the number one ("1") from the number seven ("7") similarly to [35]. Each class in the training procedure consists of 25 samples, randomly chosen from the pool of digits one (∼ 6K totally) and seven (∼ 6K totally), while the evaluation of the trained classifier is carried out on the full testing set (∼ 2K samples). In each experimental scenario we report the average of 100 runs. Moreover, in each experimental scenario we compare the proposed linear SVM-GSU (LSVM-GSU) to the baseline linear SVM (LSVM), as well as to LSVM-isotropic ( [7], [17]). We report the average precision (AP) [30] for each target class, and the mean average precision (MAP) across 100 runs.
2) Uncertainty modeling: In Appendix C, we propose a methodology that, given an image, models the distribution of the image that results by a random translation of it. The methodology is a first-order Taylor approximation, in a way similar to one used for optical flow. Then, we can show that the image representation is distributed normally with a certain mean vector and covariance matrix, which are also being evaluated. We use this methodology for modeling the uncertainty of each training image in all the experiments below. More specifically, we assume that the translation is distributed normally as t ∼ N (µ t , Σ t ), where µ t = (0, 0) , The variances of the horizontal and the vertical components of the translation, namely σ 2 h and σ 2 v , are set to where p t is measured in pixels. That is, the covariance matrix is set such that the translation falls in the square [−p t , p t ] × [−p t , p t ] with probability 99.7%. For the experiments described below, this parameter is set to p t = 5 pixels. Using the above, the mean vector and covariance matrix of the i-th image are given by (34) and (35), respectively, in Appendix C.
3) Experimental results: Table III shows the performance of the proposed classifier (LSVM-GSU) in terms of mean average precision (MAP) for the problem of discriminating digit "1" to "7", for each dataset defined above (D 0 -D 5 ). We report the average of 100 runs of each experiment. The proposed algorithm is compared both to the baseline linear SVM (LSVM), where the uncertainty of each training sample is not taken into account, as well as to a linear SVM extension where the uncertainty is taken into consideration isotropically (LSVM-isotropic) as in [7], [17]. The optimization of the C parameter for both LSVM and LSVM-GSU is performed using a line search on a 3-fold cross-validation procedure, where at each fold the training set is split to 70% learning set and 30% validation set. The performance of LSVM-GSU when the training of each classifier is carried out in the original feature space is shown in row 4, and in linear subspaces in row 5. In row 5 we report both the classification performance, and in parentheses the fraction of variance that resulted in the best classification result.
The performance of the baseline linear SVM is shown in the second row, and the performance of the linear SVM extension handling the noise isotropically (as in [7], [17]) is shown in the third row. Moreover, Fig. 4 shows the results of the above experimental scenarios for datasets D 0 -D 5 . The horizontal axis of each subfigure describes the fraction of the total variance preserved for each covariance matrix (p), while the vertical axis shows the respective performance of LSVM-GSU with learning in linear subspaces (LSVM-GSU-SL p ). Furthermore, in each subfigure, for p = 1 we also draw the result of the proposed LSVM-GSU in the original feature space (denoted with a rhombus), as well as the result of the linear SVM extension that handles the uncertainty isotropically (LSVM-isotropic) [7], [17] (denoted with a star). We report the mean, and with an errorbar show the variance of the 100 iterations. The performance of the baseline LSVM is shown with a solid line, while two dashed lines show the corresponding variance of the 100 runs. From the obtained results, we observe that the proposed LSVM-GSU with learning in linear subspaces outperforms both the baseline LSVM and LSVM-isotropic for all datasets D 0 -D 5 . Moreover, LSVM-GSU achieves better classification results than LSVMisotropic in 5 out of 6 datasets, when learning is carried out in the original feature space. Finally, all the reported results are shown to be statistically significant using the t-test [36]; significance values (p-values) were much lower than the significance level of 1%, with most values being near 10 −4 .
V. CONCLUSION In this paper we proposed a novel classifier that efficiently exploits uncertainty in its input under the SVM paradigm. The proposed SVM-GSU was validated on the large-scale dataset of TRECVID MED 2014 for the problem of video event detection, as well as on the MNIST dataset of handwritten digits. For both of the above problems, a method for modeling and estimating the uncertainty of each training example was also proposed. As a shown by the experiments, SVM-GSU, validated in the video event detection and the image classification problems, efficiently takes into consideration the uncertainty of the training examples and achieves better detection or classification performance than the standard SVM, and previous SVM extensions that model uncertainty isotropically.
APPENDIX A ON GAUSSIAN-LIKE INTEGRALS OVER HALF-SPACES Theorem 1. Let X ∈ R n be a random vector that follows a multivariate Gaussian distribution with mean vector µ ∈ R n and covariance matrix Σ ∈ S n ++ , where S n ++ denotes the space of n × n symmetric positive definite matrices with real entries. The probability density function (pdf) of X is given by f X : R n → R, Moreover, let H be the hyperplane given by a · x + b = 0.
H divides the Euclidean n-dimensional space into two halfspaces (an open and a closed one), where the closed upper half-space is given by Then, the function I + : is equal to where erf : R → (−1, 1), x → 2 √ π x 0 e −t 2 dt is the so-called error function. Moreover, if the half-space is given as the lower half-space Ω − = {x ∈ R n : a · x + b ≤ 0}, then the function I − : (R n × R) × (R n × S n ++ ) → R, given by is equal to Proof: We begin with the integral in (25). In our approach we will need several coordinate transforms. First, we start with a translation in order to get rid of the mean: Then III: MNIST "1" versus "7" experimental results (MAP). The proposed LSVM-GSU is compared to the baseline linear SVM (LSVM), and a linear SVM extension which handles the uncertainty isotropically (LSVM-isotropic), as in [7], [17].  where Next, since Σ ∈ S n ++ , there exist an orthonormal matrix U and a diagonal matrix D with positive elements, i.e. the eigenvalues of Σ, such that Σ = U DU . Thus, it holds that Then, by letting z = U y and a 1 = U a, we have a · y = a y = a (U −1 U )y = a U U z = a 1 z, where since for the Jacobian J = |U |, it holds that |J| = 1. Now, in order to do rescaling, we set z = D 1 2 v and a 2 = D 1 2 a 1 . Thus, where Ω + 3 = {v ∈ R n : a 2 · v + a · µ + b ≥ 0}. Now, let B be an orthogonal matrix such that Ba 2 = a 2 e n , which also means that a 2 = B −1 a 2 e n = B a 2 e n . Moreover, let m = Bv. Then, a 2 ·v = a 2 v = (B a 2 e n ) v = a 2 e n (Bv) = a 2 e n m. Then where Ω + 4 = {m ∈ R n : a 2 e n m+a·µ+b ≥ 0} = R n−1 ×[c, +∞), and c = − a·µ+b a2 . The norm of a 2 can be expressed in terms of a, Σ as follows and thus where c = − a·µ+b √ a Σa , and it is easily evaluated as follows Following similar arguments as above, for Ω − = {x ∈ R n : a· which leads to

APPENDIX B ON THE CONVEXITY OF THE SVM-GSU LOSS FUNCTION
Let J be the objective function of the optimization problem (4), as shown in (7). We will show that J is convex with respect to the optimization variables, w and b, over R n × R. First, as every norm is convex, and every non-negative weighted sum preserves the convexity, it suffices to show that L, as shown in (5), is convex with respect to w, b for all i = 1, · · · , l. We will prove an associated theorem first, which we will use to prove the convexity of L, ∀i.
Theorem 2. Let f : R n → R + be a non-negative, real-valued function. Then, φ : R d → R, given by is convex with respect to θ over R d , if the function h is convex with respect to θ over R d .
Moreover, max(0, λp) = λ max(0, p), for λ ≥ 0, p ∈ R, and thus, Consequently, φ is convex with respect to θ over R d . Using the results of the above theorem, by setting f (x) = f Xi (x), which is a real-valued, non-negative function (as a probability density function), and h(θ, x) = 1 − y i (w · x + b), which is convex with respect to θ = (w , b) over R d ≡ R n × R, L is proven to be convex for all i. Consequently, the objective function J is convex. That means that every local minimum of J is also a global one.
APPENDIX C MODELING THE UNCERTAINTY OF AN IMAGE Let X ∈ R n be an r × r image, where n = r 2 , given in row-wise form as where f j : R 2 → R denotes the intensity function of the j-th pixel, after a translation by t = (h, v) . Fig.5 illustrates this case of study. We will use Taylor's theorem in order to approximate the intensity function. The multivariate Taylor's theorem [37] is given below without proof.
Theorem 3 (Multivariate Taylor's Theorem). Let t = (t 1 , . . . , t n ) ∈ R n and consider a function f : R n → R. Let a = (a 1 , . . . , a n ) ∈ R n and suppose that f is differentiable (all first partial derivatives with respect to t 1 , . . . , t n exist) in an open ball B around a. Then, the first-order case of Taylor's theorem states that: If f is differentiable on an open ball B around a and t ∈ B, then for some b on the line segment joining a and b.
We will use the above theorem in order to approximate the intensity function of the j-th pixel of the given image; i.e., function f j . That is, around a, the intensity is approximated as follows f j (t) ∼ = f j (a) + ∇f (a) · (t − a), by taking b to coincide with a. Consequently, by setting a = (0, 0) = 0, the above intensity function is approximated by f j (t) = f j (0) + ∇f (0) · t.
Let us now assume that t is a random vector distributed normally with mean µ t and covariance matrix Σ t , i.e. t ∼ N (µ t , Σ t ). Then, X is also distributed normally with mean vector and covariance matrix that are given, respectively, by and Thus, by setting t ∼ N (µ t , Σ t ), it holds that X ∼ N (µ, Σ), where the mean vector µ and the covariance matrix Σ are given by (34) and (35)