Published January 25, 2025 | Version v3
Publication Open

Data Duplication: A Novel Multi-Purpose Attack Paradigm in Machine Unlearnin

  • 1. EDMO icon University of Technology, Sydney
  • 2. ROR icon City University of Macau
  • 3. ROR icon Union Theological Seminary
  • 4. ROR icon Griffith University
  • 5. ROR icon Helmholtz Center for Information Security

Description

The projects contains all the python codes necessary for our experiments. In order to make the structure more clear, all functions are divided into 6 components. Next, each component will be explained in detail, step by step. Please note that the datasets CIFAR-10 and SVHN, MNIST and Facescrub used in our experiments are derived from a third-party library, torchvision, so we do not provide the original data.
In the root directory of the project, we provide the customized model structure in model_definition.py, the training and testing functions that are generalized across almost all experiments in train_function.py, and some util functions for data loading and processing in myutils.py
Firstly, some of the samples in the original dataset are randomly designated as forgotten data, and their indexes are stored in /autoencoder/sample_index_list. To generate samples with features similar to these samples, an autoencoder needs to be trained with the target model serving as feature extractor. You may need to run train_AE.py at first, and then run generate_samples.py to generate new samples with the saved weights of the autoencoder. We provide some trained weights in /autoencoder/models and some generated examples in /autoencoder/examples
Codes for all the unlearning baselines are included in /unlearning. To implement Retraining, Random Labelling (RL) and Gradient Ascent (GA), you need to use the train() function in RL_and_GA.py  and adjust its arguments. Specifically, as described in Step 1, the train() function can be used for standard training/retraining, and can also serve as unlearning function by setting the argument RL or GA to True. Besides, the implementation for Fisher Forgetting (FF) is located in test_fisher.py.  Please note that you need to carefully segment the dataset according to the specific unlearning request ( original data, test data, retained data, forgotten data, etc.) because there are extensive experiment settings taken into account in our paper.
Backdoor attacks are used as a metric to assess the knowledge of forgotten data in unlearned models from a privacy perspective. The implementation of the backdoor attack can be found in /backdoor. The Triggers we used are also included.
We simulate the de-duplication methods in our paper, which are performed by calculating the feature similarity between the two groups of samples. The function can be found in /deduplication/simulate_deduplicate.py. In order to directly show the similarity of the two groups of samples, we also provide MSE statistical functions in statistic.py.
Codes for federated learning are in the document /FL. To get the resutls, you can run each file in jupyter notebook. This part mainly consist of methods in federated learning, including 4 metrics:Model Fidelity, Test Performance, Unlearning Efficacy and Unlearning Impact.
Codes for reinforcement learning are in the document /RL. To get the resutls, you can run each file in jupyter notebook. This part mainly consist of methods in federated learning, including 4 metrics:Model Fidelity, Test Performance, Unlearning Efficacy and Unlearning Impact. And also include two network, DQN and DDPG and also two unlearning methods: decremental and poisoning.

Files

Duplication.zip

Files (830.5 MB)

Name Size Download all
md5:917ff5789cb748ffd77dba931f0ba636
830.5 MB Preview Download