Layer-wise Relevance Propagation based Sample Condensation for Kernel Machines

. Kernel machines are a powerful class of methods for classiﬁ-cation and regression. Making kernel machines fast and scalable to large data, however, is still a challenging problem due to the need of storing and operating on the Gram matrix. In this paper we propose a novel approach to sample condensation for kernel machines, preferably without impairing the classiﬁcation performance. To our best knowledge, there is no previous work with the same goal reported in the literature. For this purpose we make use of the neural network interpretation of kernel machines. Explainable AI techniques, in particular the Layer-wise Relevance Propagation method, are used to measure the relevance (importance) of training samples. Given this relevance measure, a decremental strategy is proposed for sample condensation. Experimental results on three data sets show that our approach is able to achieve the goal of substantial reduction of the number of training samples.


Introduction
A fundamental result of learning theory is the family of representer theorems [7], which lead to the powerful kernel machines. Although trained to have zero classification error, kernel machines generalize well to unseen test data [4]. Compared to deep neural networks (DNN), they can be interpreted as two-layer NNs. Despite the simplicity, however, kernel machines turned out to be a good alternative to DNNs, capable of matching and even surpassing their performance while utilizing less computational resources in training [8,9].
Making kernel machines fast and scalable to large data is still a challenging problem. A major limiting factor is the need of saving all training samples, computing the corresponding Gram matrix, and solving the related linear equation system (see Section 2). In this paper we thus consider the problem of condensing the training samples, preferably without impairing the classification performance. Based on the interpretation of kernel machines as two-layer neural networks, we make use of explainable AI techniques [15], in particular Layerwise Relevance Propagation (LRP) [14], as a means to measure the relevance (importance) of training samples. A decremental strategy is proposed to use this measure for sample condensation.
Sample condensation has been studied in other contexts, where the whole training set has to be saved and used for classification. Starting from the pioneer work [6], more advanced techniques have been proposed to boost the performance of nearest neighbor based classifiers [2,12]. In addition, nearest neighbor condensation has been applied to speed up the training of support vector machines [1] and convolutional neural networks [12].
The remainder of the paper is organized as follows. In Section 2 we introduce the fundamentals of kernel machines and discuss the need of sample condensation, thus motivating our work. Our sample condensation method is described in Section 3. Experimental results are reported in Section 4. Finally, Section 5 concludes the paper.

Kernel machines
Kernels are an efficient way to compute the similarity of two samples in a higher dimensional space. In this section we introduce a technique to fully interpolate the training data using kernel functions, known as kernel machines. Let X = {x 1 , x 2 , . . . , x n } ⊂ Ω n be a set of n training samples with their corresponding targets Y = {y 1 , y 2 , . . . , y n } ⊂ T n in the target space. A function f : Ω → T interpolates this data iif Representer Theorem [7]. Let k : Ω × Ω → R be a positive definite kernel, X and Y a set of training samples and targets as defined above, and g : [0, ∞) → R a strictly monotonically increasing function for regulation. We define E as an error function that calculates the loss l of f on the whole sample set with Then, the function f * that minimizes E, f * = argmin f {E(X, Y )}, has the form We now can use f * from Eq. (3) to interpolate our training data. Note that the only learnable parameters are α = (α 1 , . . . , α n ). Learning α is equivalent to solving the system of linear equations where K ∈ R n×n is the Gram matrix with elements K ij = k(x i , x j ). Since the kernel function k is assumed to be positive definite, the Gram matrix K is invertible. Therefore, we can find the optimal α * to construct f * by After learning, the kernel machine then uses the interpolating function from Eq. (3) to make prediction for test samples. In this work we focus on classification problems. In this case f (z) is encoded as a one-hot vector f (z) = (f 1 (z), . . . f t (z)) with t ∈ N being the number of output classes. When predicting a test sample z, the output vector f (z) is not a one-hot vector, in fact not even a probability vector, in general. The class which gets the highest output value is considered as the predicted class. If needed, e.g. for the purpose of classifier combination, the output vector f (z) can also be converted into a probability vector by applying the softmax function. The practical usability of kernel machines strongly depends on the size n of training set. Solving the optimal α * in (5) in a naive manner requires computation of order O(n 3 ) and is thus not feasible for many applications. Recently, a highly efficient solver EigenPro has been developed [13] to enable significant speedup for training on GPUs.
Sample condensation is another way of efficiency boosting, which is required even when using high-performance solvers like EigenPro. After training, the testing using (3) still needs the whole set of training samples, which is similar to the situation with nearest neighbor based classifiers. In complex domains like strings and graphs the kernel computation may be costly [3,11,18] so that the need of considerably reducing the number of samples remains. Even in case of easy-to-compute kernel functions, it can be typically expected that not all training samples are relevant to the classification. This observation has been made before, e.g. when working with nearest neighbor based classifiers [2,12]. Thus, there is a general need of sample condensation for kernel machines. In this work we propose a novel approach tailored to sample condensation for kernel machines. To our best knowledge, there is no previous work with the same goal reported in the literature.

Sample condensation method
We make use of the neural network interpretation of kernel machines and apply the Layer-wise Relevance Propagation method to measure the relevance of training samples. Given the relevance estimation of training samples, a decremental strategy is then applied to select the most relevant samples out of a training set.

LRP for relevance measure of kernel machine
The kernel machine (3) can be seen as a network with one hidden layer. Let z be the test sample to which the target f (z) should be computed. Given a training sample x i out of the training set X ⊂ Ω n , we denote α it as the trained weight between x i and the value in the output f t (z). Figure 1 shows the network architecture of a kernel machine. The input z is represented by a single input neuron. Each training sample is represented by a single neuron in the hidden layer and connected to the input by a special connection applying the kernel function. Each output class is represented by a neuron in the output layer, connected by the individually learned weight α it .
The recent research on explainable AI has spawned many techniques, e.g. for studying the influence of hyper-parameters on training deep neural networks [5] and interpreting the behavior of neural networks [15]. In particular, it is possible to estimate the relevance of features (also hidden neurons) to the network decision. We apply such relevance estimation, concretely Layer-wise Relevance Propagation (LRP) [14], to determine the relevance of training samples.
Overall, the relevance estimation in our proposed approach consists of two steps. In the first phase the optimal α * in (3) is computed, by means of a highly efficient solver like EigenPro [13] if needed. In the second phase a set of validation samples is used to estimate the relevance of training samples, which builds the foundation for sample condensation described in Section 3.
We apply LRP to propagate the relevance back from the output layer to the hidden layer and so assign each training sample a relevance measure. The formula to compute the relevance of a neuron x i with connection to neurons x t on a given validation sample z is given by where w + = max(w, 0). Since x t is in the output layer, its relevance is the target itself R(x t ) = f t (z). The weight w it between the neurons representing x i and x t is given by the weight of the kernel machine w it = α it . The activation on the neuron representing x i is given by the kernel function k(z, x i ). We thus can express the relevance R(x i , z) of a training sample x i to the output f (z) on a given validation sample z by The target vector f (z) is a one-hot vector in our case, i.e. the value in the vector corresponding to the correct target class t * is 1 and all others are 0. Therefore, we do not need to apply the outer sum but only calculate it for t * since all other elements in the sum would be 0, which leads to Eq. (8) only focuses on the relevance of one validation sample z. To get a good estimation of the general relevance of a training sample x i , we split the training set X ⊂ Ω n in two distinct subsets X train and X val with X = X train ∪ X val , X train ∩ X val = ∅. We train a kernel machine only using the set X train . For each training sample x i ∈ X train we then add all relevances on validation samples

Relevance-based sample condensation
Given the relevance estimation of training samples, we first sort the training samples by the relevance measure. A decremental strategy is then applied to select the most relevant samples out of a training set X train by slowly eliminating the least relevant samples until only m (< n) samples are left.
The idea is to select the m samples with the highest relevance scores. A problem with this simple approach is a proper choice of the parameter m. If m is chosen too small, the selected samples will not suffice to reach a model of good accuracy. On the other hand, if m is chosen too big, the selection is sub-optimal since the same accuracy could be reached with fewer samples. Therefore, we define a parameter µ that expresses the minimum share of the original score (on the whole training set) that we like to retain. This means that for the whole training set X train , the selected samples X selectedm ⊂ X train , and a score measure s, e.g. the accuracy, the following should apply We assume that the evolution of the score is approximately monotonously rising, i.e. the score is in general higher for greater m, but may have small local noise, which is however small enough to be ignorable. Later in Section 4 we will show that the score indeed is of such a form. A possible way to find m samples that represent the data best is to drop the ∆ least relevant samples in each step. We train a new kernel machine in each step with the remaining samples and re-calculate the relevances with this machine. In general, we hope that in each step there is less redundancy. For example, a medium relevant sample can become more relevant in the next iterations, when other samples that are similar to it are dropped out. In each iteration, we thus train a kernel machine and drop the least relevant ∆ samples regarding to the validation set. The algorithm is depicted in Algorithm 1. Due to the fact that a return Xtrain, Ytrain, scorei 18: end procedure new kernel machine is computed in each iteration and its weight vector is used in the next iteration, the runtime therefore is always of order O( ntrain ∆ ). Another way to find m samples that represent the data best is to start at the other side of the set, i.e. add the ∆ most relevant samples in each step. This incremental strategy, however, turns put not to be competitive against the decremental strategy [17] and is thus not further discussed in this paper.

Data sets
For our purposes, we have chosen three data sets for image classification that are broadly used and well studied. The MNIST data set contains 60,000 handwritten digits (graylevel images are of 28 × 28 pixels) for training and 10,000 handwritten digits for testing, written by 250 different people. The MNIST-Fashion data set has the same structure as the original MNIST data set (i.e. 60,000 training images and 10,000 test images, all of size 28×28). It contains images of clothes of 10 different classes (t-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, ankle boot). The CIFAR-10 data set is formed by selecting and labeling proper images out of the 80 million tiny images data set. It contains 50,000 training images plus 10,000 designated test images, each being a 32 × 32 RGB-image and labeled with one of the ten classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). The objects in the images were captured from different view points and from different distances, which leads to   more variety in the data set compared to the other two data sets. For all three data sets, 90% is of the training data is really used as training data while the remainder 10% serves as validation data for relevance estimation.
The state-of-the-art classification results on these data sets can be found in [10,16,20], respectively. It is important to emphasize that it is not our goal to beat these results. Instead, we use them to study the ability of our approach to sample condensation without impairing the classification performance of kernel machines. The power of kernel machines themselves as classifier has already been demonstrated in the literature [4,9].

Results
Since convolutional neural networks (CNN) are powerful in feature learning, we train a CNN with all the samples and resort to using the learned features from the convolutional layers as input to a kernel machine. In all our experiments we use the Laplacian kernel k(x 1 , with a bandwidth σ = 7. We chose the bandwidth σ = 7 since the experiments only show minor improvements with larger bandwidths. In Figure 2 we show the performance of our approach on the three test sets (the step size is set to ∆ = 100). For comparison purpose, we also show the performance of the same number of randomly selected samples. We marked the original accuracy with the whole test set and µ = 99% and µ = 98% of this accuracy as stop criterion described in the algorithm. Note that for the MNIST   set we alternately chose µ = 99.9% and µ = 99.8% since the accuracy for this set was still high even with only 2,500 of 60,000 samples left. The required number of samples to reach a certain accuracy is shown in Table 1.
When decreasing the number of samples (i.e. reading the figures from right to left), the accuracy of the randomly selected samples considerably decreases, while the accuracy of the samples selected with our approach only decreased slowly (CIFAR-10) or stays static (MNIST and MNIST-Fashion). For MNIST-Fashion, reducing the samples even increases the accuracy slightly.
To produce the data for the decremental approach in Figure 2 we reduced the training set down to 2,000 samples. This took 7.9h for the two MNIST data sets and 11.6h for CIFAR-10 because the feature vectors contain more elements here. Note that the runtime is recorded on a computer with 16 GB memory and Nvidia GTX 760. As an example, Figure 3 shows the accumulated runtime in dependency to the number of samples left on the MNIST data set.
We now investigate if the selected samples have a higher expressiveness in general or if it is limited only to our special experiment with kernel machines. Therefore, we only use the individual selected samples to train the original CNN and compare the accuracy of the resulting model to the accuracy with a model trained of the same number of randomly selected features. Table 2 shows the result of this comparison for the samples selected with our approach. We can see that the selected samples do not seem to have a higher expressiveness in general. Only for the MNIST data set, where we managed to select very few samples, the accuracy of the selected samples is higher. On the other data sets, the accuracy is mostly the same or slightly worse.
Overall, our LRP-based sample condensation technique can reduce the number of training samples while still preserving the high accuracy of kernel machines. As input for these studies we have chosen the output of the convolutional (feature learning) part of a CNN, which was trained once with the whole data set. In the comparison shown in Table 2, the CNN, especially its convolutional part, was only trained with the remaining, selected samples of the previous experiments. Since we could not show that the selected samples do lead to greater accuracy than randomly selected ones, we come to the following conclusion: The convolutional (feature learning) part of a CNN really benefits from a large base of training samples, whereas for the fully connected (classification) part a smaller base on training samples is sufficient.
It is important to mention that achieving a general higher expressiveness of the selected training samples is not the goal of this work. In fact, it cannot necessarily be expected since our approach is tailored to kernel machines. Our goal is to reduce the number of training samples to store for model inference and classification of unseen patterns, which is clearly achieved. We could reduce both the size of the model and the complexity of computation for kernel machines. To maintain 99% (99.9% for MNIST) of the original accuracy, we could reduce the number of training samples to 5% of the original training set for an easier task like MNIST and 43% for a more complex task like CIFAR-10.

Conclusion
This work intends to achieve substantial reduction of training data to store for model inference and classification of unseen patterns for kernel machines. Based on the neural network interpretation of kernel machines, we apply explainable AI techniques, in particular the Layer-wise Relevance Propagation method, to measure the relevance (importance) of training samples. A decremental strategy has been proposed for sample condensation. Our experimental results demonstrated the ability of our approach to considerably condense the training set without impairing the classification performance. Currently, we apply a rather straightforward decremental strategy for the condensation purpose. More sophisticated techniques can be studied in future. For instance, the concept of sparse representations [19] may be an option to model the importance of training samples.
To our best knowledge, our work is the first contribution to sample condensation tailored to kernel machines. As such it contributes to making kernel machines fast and scalable to large data. In addition, it also represents a novel application of explainable AI techniques.