Exploiting One-Class Classification Optimization Objectives for Increasing Adversarial Robustness

This work examines the problem of increasing the robustness of deep neural network-based image classification systems to adversarial attacks, without changing the neural architecture or employ adversarial examples in the learning process. We attribute their famous lack of robustness to the geometric properties of the deep neural network embedding space, derived from standard optimization options, which allow minor changes in the intermediate activation values to trigger dramatic changes to the decision values in the final layer. To counteract this effect, we explore optimization criteria that supervise the distribution of the intermediate embedding spaces, in a class-specific basis, by introducing and leveraging one-class classification objectives. The proposed learning procedure compares favorably to recently proposed training schemes for adversarial robustness in black-box adversarial attack settings.


INTRODUCTION
One of the most important drawbacks of the application of deep neural networks in sensitive image/video classification tasks is their limited robustness to adversarial attacks i.e., they are susceptible of being fooled by carefully crafted minor/humanly imperceptible perturbations.Adversarial attacks are methods that calculate such perturbations by exploiting the neural network backward pass to obtain gradient flow from the activations of the final (or even some intermediate) layer towards the input, using some loss function.When both the model architecture and parameters are known to the adversary, adversarial attacks are classified as white-box, while black-box/transferability attacks are devised from different host models or from the same architecture with different parameters.Up-to-date, there is a wealth of literature describing different forms of adversarial attacks that can be found in review papers [1,2], where the reader is referred to.This work has received funding from the European Union's European Union Horizon 2020 research and innovation programme under grant agreement 951911 (AI4Media).This publication reflects only the authors' views.The European Commission is not responsible for any use that may be made of the information it contains.
Adversarial defenses are methods designed to counter adversarial attacks.The most prominent defenses so far are based on adversarial training [3,4], which in simplified terms, involves training a deep neural network with adversarial examples of predefined noise margins, calculated implicitly or explicitly.Such approaches have two important disadvantages.First, they require a significantly added workflow during the model training process for generating and training with adversarial attacks, second, the resulting models seem to have decreased classification accuracy in clean data.On the contrary, another line of work [5,6] achieves robustness by manipulating the properties of the learned feature space, by exploiting distance-based optimization criteria in the form of intermediate supervision functions.As a result, the learned representation has decreased within-class dispersion and increased between-class separation in the intermediate feature spaces, while such approaches can be used in conjunction with adversarial training, for added robustness benefits.
This work builds on the latter direction and extends the recently proposed Hyperspherical Class Prototypes (HCP) method [6], by incorporating novel optimization terms inspired by the present state-of-the-art in deep neural networkbased one-class classification problems [7,8,9].The proposed method does not imply modifications to the deep neural architectures or the creation of adversarial examples for training purposes.It is deployed in the form of alternative loss functions that supervise the distribution of final and intermediate layer activation values.It is shown that the proposed method increases (or at least does not hinder) the classification accuracy in clean examples, while it provides increased robustness to adversarial attacks at the same time.The proposed method is evaluated in black-box/transferability-based adversarial attack settings in image classification tasks, as this scenario excludes any potential robustness induced by gradient obfuscation [10].
The rest of the paper is structured as follows.Section 2 overviews existing adversarila defenses.Section 3 analytically describes the components of the proposed method.Section 4 describes the experiments conducted in order to evaluate the effectiveness of the proposed method in image classification problems in publicly available datasets.Finally, conclusions are drawn in Section 5.

ADVERSARIAL DEFENSES
Adversarial defenses in classification systems aim to increase their ability to withstand or overcome input perturbation, generated by adversarial attacks.Assuming a classification system y = f (x; θ), where f is the model decision function parametrized by θ, x are the model inputs and y is the model prediction, robustness is quantified by determining its tolerance to perturbation ∥p∥ < ϵ per se, i.e., f (x; θ) = f (x + p; θ).Here it should be noted that other definitions of adversarial robustness has been proposed in the past, that focus on altering the classification architecture, e.g., input filtering [11], Generative methods [12].Using the above definition of robustness, we consider such methods irrelevant to the proposed one.Up to date, the perturbation levels required to fool neural network classifiers with adversarial attacks are very low, i.e., perturbed images are almost indistinguishable from the original ones to the human eye.
Our work focuses on adversarial defenses that modify the training process of neural network, while maintaining the same neural network architecture, only by trying to derive in different parameters i.e., f (x; θ).The straightforward approach to this end is to fine-tune or re-train the model by exploiting adversarial samples, derived by employing one or more adversarial attack methods [3,4].This process can be applied during training by employing an additional objective function inspired by adversarial attacks.For instance, the Fast Gradient Sign [3] objectives have been employed for adversarial training in the following manner: where x is an adversarial sample derived from x using the Fast Gradient Sign method, L CE is the standard cross entropy loss function and 0 ≤ λ ≤ 1 is hyperparameter that controls the learning balance between clean and adversarial samples (a value equal to 0.1 has been proposed showing good results [3]).A more sophisticated variant [4] generalized the adversarial training approach by incorporating combinations of general adversarial attacks and remains up to date, the most efficient defense mechanism.The problem of adversarial robustness can also be treated from a domain adaptation point of view [13].That is, intermediate layer clean and adversarial data representations are projected to a subspace by employing a Graph Neural Network [14], and the divergence between them is minimized by computing an approximation of the Wasserstein distance [15].The main disadvantages of these approaches are the introduced workflow for calculating the adversarial examples, while at the same time, model classification accuracy in clean data is negatively affected.Moreover, due to the adversarial attack-specific nature, there is no guarantee [16] that such defenses remain effective against different types of adversarial defense.
Ultimately, the effectiveness of adversarial defense methods that fall into the above category seem to rely on achieving the production of as similar intermediate data representations as possible for both clean and adversarial images belonging to the same specific class.Recently proposed adversarial defenses [5,6] showed that incorporating distance-based optimization criteria might achieve this goal, without requiring re-training the model with adversarial examples.The second advantage of such methods is that they might employ adversarial training as a complementary step, providing increased robustness to specific adversarial attacks.Inspired by the Nearest Centroid Classifier [17] and combining ideas related to the triplet-loss [18] and center-loss [19] functions, the classification model is encouraged to produce class data representations that lie close to some learned class prototype vectors, leading to increased robustness in adversarial attacks, by only having minor degradation in classification accuracy for clean samples.More specifically, recently proposed adversarial defenses achieve this goal by learning class prototype vectors in the intermediate hidden layer spaces, and minimize the distances between the class data representations and the prototype vectors.For instance, assuming g k (x; θ) to be the k-th layer representation of some input x, and a j the j-th class prototype vector, the Center Loss [19] criteria are optimized as follows: leading to more compact data representations for elements belonging to the j-th class.Here, it should be noted that this specific function has some drawbacks related to the representation collapse problem as pointed out in recent work [8,9]; that is, the loss might lead to trivial solutions after some optimization steps.To counteract such effects, modified versions of it have been proposed in one-class classification settings, e.g., early stopping criteria [7,8], as well as in adversarial robustness methods, including regularizers and contrastive loss formulations [5,6].

ROBUST ONE-CLASS CLASSIFICATION-BASED TRAINING LOSS
The relevance of one-class classification methods to adversarial robustness stems from the fact that adversarial samples may be considered as outliers to the standard training data distribution.Moreover, in contrast to multi-class classifiers, oneclass classifiers are not obliged to output a specific class for each of their input; if the input data fall outside all one-class model distributions, they are considered as outliers, by definition.These facts have been demonstrated in [7,20] where one-class classifiers had been employed as adversarial sample detectors.This work does not employ one-class classification as adversarial sample detectors, but only as a vehicle to construct the robust feature learning process.The first objective of the proposed learning process is to derive tight class boundaries in the deep representation space.We adopt the HCP optimization problem [6] to this end.That is, the optimal tight class boundaries are determined by enclosing feature space class data representations with hyperspheres, and thereby minimize the respective hypersphere volumes.This method alters the training procedure of a standard neural network architecture, by training in-parallel, an additional layer that includes the prototype vector centers in the feature space.The one-class classification criteria have been formally extended to the multi-class classification case.Let K be the set of layers on which the proposed objectives will be applied to, where g k (x; θ) is k-th layer representation of some input x.This method aims to learn hypersherical prototypes in the k-th layer defined by the prototype matrices A (k) ∈ R C×L k , where L k is the dimensionality of the k-th layer, and radii R |K|×C that will act as one-class classifiers, verifying data sample activations belonging to the j-th class.To this end, the optimization problem for each sample x i is the following: min: where a (k) j is the prototype center for class j, y ij = 1 if sample x i belongs to class j, or y ij = −1, otherwise, ξ ki are the slack variables and c k ≥ 0 is a hyperparameter that allows training error (i.e., soft margin formulation) relaxing the optimization constraints.The constraints of the above optimization problem can be optimized by applying the following hinge loss function in every layer selected in K: (4) In the deep learning case, both the feature vectors and the prototype vectors are trainable parameters, optimized by the corresponding hinge loss function, thus we employ a value of , then the data representation g k (x i , θ) falls inside the j-th class hypersphere, while otherwise, the item lies outside the j-th hypersphere.The loss value is L M > 0 if and only if the one-class classifier decision function misclassifies x i , and it is equal to the distance of the data representations in the feature space from the closest hypersphere outer boundary.The compactness of the derived class representations is proportional to the learned value of the corresponding radius r kj .
The above function does not produce loss values for marginal data items, i.e., items lying close to the hypersphere boundaries.The HCP optimization procedure as defined in [6] introduced geometrically inspired tricks to solve those issues.This work considers different optimization terms, inspired by well-established OCC methods.Specifically, we employ a contrastive loss term for items belonging to the same class.To this end, we consider a mini-batch of size N is randomly sampled and the contrastive prediction task is defined on pairs of data representations derived from the mini-batch, resulting in 2N data points.For a pair of data representations belonging to the j-th same class, the loss function is defined as follows: ) where z i are the remainder mini batch representations and T is the so-called temperature hyperparameter (a value of T = 0.25 was used in all our experiments).The introduction of the above loss term promotes the derivation of similar representations in the feature space, without minimizing their Euclidean distance.However, as pointed out in oneclass classification tasks [9], the L C might indirectly increase the Euclidean distance, especially if it is very small, which is something that is contradicting to adversarial robustness.Therefore, we follow the same practice and also employ an Angular loss term [9] to complement this contrastive loss: Finally, we formulate the proposed learning procedure called Robust One-class Classification (ROCC) loss function as the combination of the constraints of the abovementioned optimization terms, as follows: where relevant weighting hyperparameters can be considered as well i.e., L ROCC = µ 1 L M + µ 2 L P + µ 3 L N P , for adjusting the contribution of each term to the overall loss.In our experiments, weighting parameters were not employed (i.e., µ 1 = µ 2 = µ 3 = 1), since it was found that the relevant loss terms produce values that allow smooth optimization.
The proposed optimization terms are employed together with standard Cross entropy loss in the final layer of a neural network, and are advised to be separately implemented in intermediate layers.Determining where is the optimal place to introduce the intermediate supervision constraints is an open problem.Our selection is described in the experimental results.Nevertheless, it should be pointed out that a trade-off between optimal classification accuracy and adversarial robustness should be considered; i.e., the closer to the input the intermediate supervision step is employed with the proposed optimization options, the more the adversarial robustness, while the closer to the output, the better the classification accuracy of the model.

EXPERIMENTS
This section describes the experiments conducted for evaluating the performance of the proposed optimization scheme.ResNet-101 [21] architecture was employed as the baseline architecture, which is typically employed in image classification problems and produces close to state-of-the-art results.
In terms of datasets, we have employed the publicly available CIFAR-10, CIFAR-100 [22] and SVHN [23] datasets which contain 10, 100 and 10 classes, respectively.The classification models were pretrained for 200 epochs using softmaxonly and fine-tuned for an additional 400 epochs using the loss function proposed by the different adversarial defense methods.Along with the proposed method (ROCC), we have also employed the Hyperspherical Class Prototype method (HCP), the PCL adversarial defense [5] (PCL) and the closely related center loss function [19] (CL).Hereafter, we refer to the competing methods with their respective acronyms.The loss functions for the proposed ROCC method, the HCP, PCL and CL and were applied in the same ResNet layers (i.e., the 256-dimensional layer-3 and 1024-dimensional final layer).All experiments were implemented in Pytorch 1.6.0.
In our first set of experiments, we compare the classification performance of the competing methods in the employed datasets.Since all datasets are well balanced in terms of contained classes and contain many test samples, we compare the competing methods in terms of classification accuracy.Table 1 reports the obtained classification accuracy in the respective datasets.As can be observed, the proposed method outperforms all other adversarial robustness methods in every case while it even outperformed the vanilla softmax optimization function in two cases.This can be attributed to the fact that the proposed optimization functions only consider how to obtain better representations for each class, thus being compatible with any standard classification loss function.In our second set of experiments, we evaluate the Robustness of the competing methods to the iterative projected gradient descent (PGD) [3] attack, with a corresponding parameter e = 0.1.To this end, we employed the Vanilla ResNet architecture for generating adversarial samples, and inferred their labels by the respective robust models trained using the competing methods.Here it should be noted that this attack is the strongest form of transferability attacks, since the only difference between the attack and target architecture are the network parameters.The results are reported in Table 2.As can be observed, in the 10-class datasets (CIFAR-10, SVHN) the proposed ROCC method outperformed the competition, except the CIFAR-100 case.Finally, in our third set of experiments, we employed the competing architectures to attack each other, as "host" and target architectures.We again used the PGD attack with e = 0.1.Here, it should be expected that the most robust architectures are supposed to a) remain robust in transferability attacks and b) create strong adversarial samples that are able to fool the other defenses.As can be observed, the proposed ROCC method produces the strongest transferability attacks among the competition (red), while at the same time, it remains the most robust in the opposite scenario (bold).

CONCLUSION
This work described a method for increasing the robustness of deep neural network-based image classification systems to adversarial attacks, by exploiting and re-formulating oneclass classification inspired optimization criteria.Experimental results denoted that the proposed optimization scheme increases adversarial robustness in black-box adversarial attacks without negative effects in classification accuracy.As this work found an interesting link between one-class classification and adversarial robustness, future work could include studying the opposite direction; i.e., adapting adversarial robustness methods for training one-class classification problems.In addition, the proposed criteria should also be studied in other forms of computer vision problems, e.g., regressionbased problems such as object detection/tracking.

Table 1 :
Classification accuracy of the competing methods.

Table 2 :
Robustness (classification accuracy) in PGD blackbox attack, by using the Vanilla ResNet architecture as attack model.