AdvRevGAN: On Reversible Universal Adversarial Attacks for Privacy Protection Applications

Different adversarial attack methods have been proposed in the literature, mainly focusing on attack efficiency and visual quality, e.g., similarity with the non-adversarial examples. These properties enable the use of adversarial attacks for privacy protection against automated classification systems, while maintaining utility for human users. In this paradigm, when privacy restrictions are lifted, access to the original data should be restored, for all stakeholders. This paper addresses exactly this problem. Existing adversarial attack methods cannot reconstruct the original data from the adversarial ones, leading to significant storage overhead for all privacy applications. To solve this issue, we propose AdvRevGAN, a novel Neural Network architecture that generates reversible adversarial examples. We evaluate our approach in classification problems, where we examine the case where adversarial attacks are constructed by a neural network, while the original images are reconstructed using the reverse transformation from the adversarial examples. We show that adversarial attacks using this approach maintain and even increase their efficiency, while the classification accuracy of the model in the reconstructed data can almost totally be restored.


INTRODUCTION
Traditional research on image privacy protection often assumes human adversaries.In other words, privacy risks are usually quantified by how effectively the information contained in images can be picked up by human eyes and brains.As a result, "blurring", "pixelation", and "mosaic" are still the most widely used techniques to protect privacy in images, even while their effectiveness against automatic analysis tools is limited [1], [2].On the other hand, the field of privacy protection against automatic analysis tools is gaining increased value in social media settings, where we assume that human users are not adversaries, while automatic image crawlers might want to collect images of a specific social media users.To this end, deidentification methods based on uni-versal adversarial attacks have been proposed to disable automatic face detection/recognition [3], or adversarial attack methods that guarantee the principles of k-anonymity [4,5], while introducing the minimum possible perturbation to the original images, maintaining the utility of the data for human viewers [6].
Nevertheless, an important privacy protection aspect is not only to maintain the utility of the deidentified data but to be able to completely restore the original data, upon request.To this end, the most straightforward approach is to maintain a local copy of the original data.However, such a solution severely increases the storage overhead; therefore, it would be a lot more useful if we only had a single function for calculating the privacy protection transformation.Universal adversarial perturbations could be used to this end [7,8], however, the actual transformation to the images is merely additive noise, and most importantly, it is the same for any given input image.Thus, a third party with access to a single original and perturbed image pair can easily uncover the perturbation.Therefore, in privacy protection applications, it is essential that this transformation must be unique for a given input image.To this end, transformation-based adversarial attacks have been proposed [9], where the universal perturbation is based on a linear multiplicative transformation, thus it is indeed unique for each image.However, the parameters of the transformation matrix can still be approximated by using a sufficient number of original-adversarial image pairs.
In this work, we extend our previous work in transformationbased universal adversarial perturbations [9] to the nonlinear case.The role of the transformation function is assigned to a deep Generative Neural Network, which is composed of multiple nonlinear activation functions within its architecture.Therefore, the output perturbation is unique for each given input, whereas the parameters of the network cannot be attained by third parties.Specifically, we propose the Adversarial Reversible Generative Network (AdvRevGAN) architecture, which produces reversible adversarial examples that work in various input sizes.In contrast with a simple transformation, where input size directly affects the number of parameters (transformation matrix), AdvRevGAN is able to handle different input sizes and perform well without any change in the size of the model.

Universal Adversarial Attacks
Universal Adversarial Attacks (UAP) calculate a perturbation that generalizes for different (almost all) instances of the dataset by employing image-specific adversarial attack constraints.The usage of the same calculated perturbations can decrease the attack complexity by accessing a single vector during inference.This perturbation has been an important contribution to different systems' generalizations [7].On the other hand, universal adversarial attacks produce more noisy images when compared to image-specific ones [5].According to to [7], the overall optimization function is formulated as follows: where x i ∈ R d is a dataset sample, n is the perturbation vector and f (•) is the classifier such that the target model misclassifies the adversarial sample y i = x i + n.
In the same vein, a variant of the UAP method, namely SGD-UAP was first introduced by [8].According to [10], the creation of UAPs is based on using a variance of the Projected Gradient Descent (PDG) attack.In particular, it is proved that SGD can lead to better evasion rates and as a result, it was chosen over other methods [11].Moreover, it has better convergence compared to UAP.In more detail, it optimizes the objective i L f (x i + n) over batches rather than single inputs where L f is the model's training loss and x i can be batches of input images, and n ∈ R D are the set of the determined perturbations.The gradients updates towards n are calculated in batches in the direction of − i ∇L f (x i +n).It has been proven that the SGD-UAP method can create UAPs in a more effective way than the originally proposed method.In both cases, the derived perturbation vector is the same for any given input image.

Transformation-based Universal Adversarial Attacks
The adversarial attack optimization problem can also be viewed as a transformation estimation one, that is expressed as follows: where g(•) : R d → R d is an iterative transformation that maps the data samples of the clean domain X to an adversarial domain Y, while ϕ are the parameters of the transformation.Here, it should be noted that any type of function can be employed in order to solve the proposed optimization problem, i.e., g(•) could represent any linear/non-linear transformation or even a whole neural network.This formulation allows more flexibility in the definition of additional optimization constraints.For instance, the constraint of reversibility, which is very useful in privacy protection settings, could be expressed as an additional optimization constraint, i.e., g −1 (y) = x.

Multiplicative Universal Adversarial Attacks
The Multiplicative Universal Adversarial Transformation (MUAT) [9] is a method that exploits the Transformationbased universal adversarial attack definition.It examines the simplest case where g(x) = Tx is a linear transformation, where T ∈ R d×d is a matrix.The original image can be obtained from the adversarial image, if matrix T is invertible.Specifically, this matrix stores the transformation parameters for clean sample perturbations.While in standard additive noise-based universal adversarial attacks, a simple subtraction using a single adversarial-clean image pair attains perturbations, in the case of multiplicative noise, the analogous is to reverse engineer the matrix T from the data, which cannot be obtained, using just a pair of clean-adversarial samples, since the rank of T is supposed to be larger than 1.
The overall optimization function of the MUAT method is the following: where T is the learnable transformation matrix, t ̸ = f (x; θ) is a target class, 1 − s(•, •) is an additional constraint based on a similarity-based loss function according to the CW-SSIM metric [12] and λ is a hyper-parameter for controlling the significance of the adversarial attack term of the loss function.

Adversarial Examples with Generative Adversarial Networks
Generative Adversarial Networks (GANs) introduced by Goodfellow [13] for creating generative models P G which model the data distribution P data used in the training set.More specifically, GANs consist of two DNNs that are trained simultaneously, a generator network G : Z → Y and a dis- The generator G is fed with random noise z generating instance y adv from a probability distribution P G .Then fake y adv and real instance x are fed to discriminator D that tries to differentiate fake from real instances.From the classification procedure of y adv , the discriminator produces a label that indicates whether y adv belongs to the P data (real input) or P G distribution (adversarial input).
In a nutshell, generator G is trained in a way that maximizes the probability of discriminator D being deceived.In this case, GANs manage well enough to generate instances, almost identical to the original samples.In adversarial attack settings, GANs aim at misleading a pretrained classifier, f : Y → C in a given dataset using a generator that transforms the input image.In particular, in the work of [14] the generator outputs the noise which is being added to the input image generating the adversarial attack in an efficient way.Training in a black box-attack context, losses are based on the input and output of the classifier without any knowledge of the inner function of the classifier.Thus the Loss function is defined as follows: where α and β control the relative importance of each objective.Note that L GAN here is used to encourage the perturbed data to appear similar to the original data x, while L f adv is leveraged to generate adversarial examples, optimizing for the high attack success rate.

Reversible Generative Adversarial Networks
Inspired by Image-to-image translation (I2I), our work considers the case where X is the clean image domain and Y is the adversarial image domain.The adversarial image domain can be obtained implicitly, by training a generator to produce adversarial examples, or explicitly, by using any adversarial attack.Then, our goal is to create an image-to-adversarial image translation model which is approximately invertible by design.The image-to-image translation aims at transferring images from a source to a target domain while retaining content representations [15] [16].According to [17], the goal is to find the appropriate mapping between two given domains X and Y, while minimizing the corresponding loss functions for unpaired training data.To this end, two mappings G : X → Y and G −1 : Y → X are learned, following the cycle-consistency.
In a similar fashion, we create a generator G : X → Y such that G(x i ) = y adv i in order to generate adversarial examples such that f (y adv i ) ̸ = f (x i ) (untargeted attack).Also we design an "inverse" generator, G −1 : Y → X .Then, G −1 is another architecture that produces x rec i as approximations of x i .Figure 1 depicts the architecture of our model.
The forward mapping of generator G and the backward one of G −1 are broken down into three components.X is the original image domain, Y adv real is the original adversarial image domain while Y adv is the domain of adversarial generated images that are produced by G.We associate a feature space X and Ỹ in higher dimensions for each domain respectively.Mappings between original and adversarial image space are individual and non-invertible.More specifically, for real image space X, we use an encoder Enc X : X → X that extracts the image features of X, lifting the image into a higher dimensionality feature space and a decoder Dec X rec : X → X rec that switch the image back to a lower image space in same dimensions as the initial.We follow the same procedure for generated adversarial image domain Y adv using Enc Y adv : Y adv → Ỹ and Dec Y adv : Ỹ → Y adv .
Between feature spaces, we have an invertible core such that C : X → Ỹ and C −1 : Ỹ → X.As a result, we demonstrate the full mappings that are: We first define loss for discriminator D X to ensure that x i and x rec i are close.
where mse is the mean squared error and: x ̸ ∈ X .
Similarly we encourage the discriminator D Y adv that y adv and x are indistinguishable with the following loss function: where: We define L1 loss for the generator to ensure that x i , x rec i follow the same distribution.Additionally we introduce the L cycle loss in order to measure the distance between x i and x rec i : Besides maintaining visual similarity, the generator network must also derive the actual adversarial examples.For these examples, we demand that they are misclassified by the classifier, formulating a loss function, that exploits some adversarial attack, e.g.,: where L f is a classification loss function, f (•) is a classifier and t is a target attack class index, that could be different from the original sample label.In fact, any adversarial attack can be employed.In our experiments, we have employed the C&W [18] loss function has been employed.Furthermore, we ensure that the perturbation on the image does not alternate entirely with the original image.For that reason, we define perturbation loss as follows Last, the two losses L adv and L pert constitute the loss functions for training G and G −1 .

EXPERIMENTAL RESULTS
This section presents the experimental results of the proposed AdvRevGAN approach.As a baseline method for adversarial generation we have employed the SGD-UAP method, while for the reversible ones, MUAT is used.All methods have been implemented in Python using Pytorch.The training parameters used in AdvRevGAN and MUAT are the number of epochs, the training samples, and the learning step for the Adam optimizer [19].For the SGD-UAP method, the parameters are more and consist of the number of epochs, the upper threshold for L p norm of the attack, the pixel clamping value of the attack, the training samples and the learning step for SGD optimizer.As an evaluation dataset, we have employed MNIST [20], which is commonly used for evaluating adversarial attacks.Although it is a very easy dataset for classification, this is what makes it challenging for adversarial attacks, since adversarial attacks must generate more noise in order to fool the classifier.This way, generating the inverse image is also more difficult.Yet, due to its simplicity, the number of trainable parameters remains low for both the proposed and the competing methods, making the results easily reproducible.Adversarial attacks for both datasets were performed using the Carlini-Wagner L2 method [18].The results obtained are analyzed in terms of the Mean Squared Error (MSE) and the Structural Similarity Index Measure (SSIM), which provide insights into the quality of the adversarial examples generated and the reconstructed images produced by the different methods.For implementing the experiments with the MNIST dataset, a LeNet-5 classifier was trained initially on the training set and evaluated on its test set.The accuracy of the classifier that was attacked in our experiments was 98.4%.
Table 1 shows a comparison of the proposed method, SGD-UAP, and MUAT in terms of adversarial attack generation.As can be observed, the proposed method produces less noisy perturbations when compared to other methods, while it remains effective in reducing classification accuracy, as well.Table 2 presents the results of our proposed method, in terms of the reconstruction quality.As can be observed, the classification accuracy in the reconstructed data is restored to 90.7%, while the structural similarity of the reconstructed samples with the original ones is very high, while the MSE of the reconstructed data when compared to the original data is very low.Finally, Figure 2 shows a qualitative evaluation of the competing methods.As can be observed, the proposed method produces adversarial examples that look very similar to the original data while it is able to reconstruct the original data sufficiently well.

CONCLUSIONS AND FUTURE WORK
A reversible adversarial attack method has been described, that produces a reversible mapping function that uniquely maps given input images into an adversarial domain, where its inverse can almost reconstruct the original input.The proposed method allows the generation of untargeted adversarial examples that are also reversible for different dataset complexities using generative adversarial networks (GANs).The proposed AdvRevGAN generates adversarial attacks with less noise when compared to legacy adversarial attack methods.Last but not least, the transformation cannot be obtained by third parties, since it is non-linear, and requires access to the neural network architecture and parameters.
According to recent research [21], diffusion models are suggested as a promising alternative to GANs for generating diverse and realistic samples as they use a diffusion process to iteratively transform a noise vector into a sample that matches the data distribution, and they have shown to be more stable and easier to train than GANs.Their ability to capture complex multi-modal distributions makes them a viable alternative for generating synthetic data in scenarios where labeled data is limited costly to obtain.Future work will consider extending the proposed architecture to also accommodate differential privacy constraints in the adversarial attack optimization problem using more complex datasets and include the diffusion models in our experiments.
where • denotes the composition of Enc X , C, Dec Y adv for function G and Enc Y adv , C −1 ,Dec X rec for function G −1 .Also for each image space, X and Y adv we use domainspecific discriminators D X and D Y adv for training with the adversarial loss.

Fig. 2 .
Fig. 2. Adversarial examples and reconstructed images on MNIST Dataset.The first column depicts original images x i , the next three columns are the corresponding adversarial examples y i adv generated by the proposed method, MUAT and UAP respectively while above them demonstrated the wrong class that predicted by the model.In the last two columns are demonstrated the reconstructed images x i rec derived by MUAT and our proposed method respectively.

Table 1 .
Comparison results on MNIST dataset