UNICODE

li, songrui

doi:10.5281/zenodo.17984782

Published December 22, 2025 | Version v1

Dataset Open

UNICODE

li, songrui

Contents

This repository provides the official implementation of the paper 'Enhancing Code Model Robustness Against Identifier Renaming via Unified Code Normalization'. 

Description

Deep code models (DCMs) are increasingly deployed in security-critical applications. Yet, their vulnerability to adversarial perturbations – such as subtle identifier renaming – poses significant risks, as these changes can induce out-of-distribution inputs and cause insecure predictions. A key challenge lies in defending against such attacks without prior knowledge of adversarial patterns, as the space of possible perturbations is virtually infinite, and conventional rule-based defenses fail to generalize. To address this challenge, we primarily focus on defending renaming-based adversarial attacks, which have the most significant impact on DCMs’ security, and propose a novel two-stage defense framework named UniCode, which proactively normalizes adversarial inputs into uniformly in-distribution representations. Please refer to overview.jpg for a detailed overview of our method's structure. Specifically, the first stage strips all identifiers into placeholders, eliminating adversarial influence while maintaining the code structure and functionality, and the second stage reconstructs semantically meaningful identifiers by leveraging contextual understanding from large language models, ensuring the comprehensive code semantics are preserved. By fine-tuning the code models on the normalized distribution, UniCode renders models inherently robust against diverse renaming attacks without requiring attack-specific adaptations. To evaluate the performance of our approach, we have conducted a comprehensive evaluation by comparing it with state-of-the-art baseline methods. The experimental results demonstrate that UniCode achieves the best defense performance on 82.22% of subjects, with average improvements ranging from 9.86% to 46.1% over baselines in terms of defending effectiveness, indicating the superior performance of UniCode.


Structure

Here, we briefly introduce the usage of each directory in UNICODE:

```
├─ALERT (Our baseline, for each baseline we provide code on three model/datasets due to space limited)
│  ├─codebert
│  ├─codet5
│  ├─graphcodebert
├─CDenoise (Our baseline, CodeDenoise)
├─CODA (Our baseline)
├─ITGen (Our baseline)
├─MARVEL (Our baseline)
├─CodeTAE (Our baseline)
│─code (Our Approach, UniCode)
│  ├─abstract.py (abstracting identifier names)
│  ├─replace_method_name.py (abstracting function names)
│  ├─normalization.py (conducting code instantiation using LLM)
│  ├─build_txt.py (conducting data pre-processing)
│  ├─VulnerabilityPrediction_build_jsonl.py (conducting data pre-processing)
├─python_parser (parsing code for further analysis)
```

Datasets/Models
The overall model and datasets used in this paper has been listed below. All experimental datasets are provided in data.zip, and the corresponding model weights are stored in the current directory.

|           Task           |  Dateset   |   Train/Val/Test   |  acc   |
| :----------------------: | :--------: | :----------------: | :----: |
|     Clone Detection      |  CodeBERT  | 90,102/4,000/4,000 | 96.88% |
|                          |   GCBERT   |                    | 96.73% |
|                          |   CodeT5   |                    | 96.40% |
|                          | CodeT5Plus |                    | 97.47% |
| Vulnerability Prediction |  CodeBERT  | 21,854/2,732/2,732 | 63.76% |
|                          |   GCBERT   |                    | 64.13% |
|                          |   CodeT5   |                    | 59.99% |
|                          | CodeT5Plus |                    | 58.13% |
|    Defect Prediction     |  CodeBERT  |   27,058/–/6,764   | 84.37% |
|                          |   GCBERT   |                    | 84.89% |
|                          |   CodeT5   |                    | 88.82% |
|                          | CodeT5Plus |                    | 88.99% |

Requirements:

- python==3.7.7
- transformers==4.8.2
- pytorch==1.5.1
- pandas==1.3.0

Reproducibility

Usage Instructions

To leverage our abstract framework and instantiate methods, follow these steps:

Configure Paths Modify the following variables in the code:

```
input_path = "your/input/path"  # Replace with your input directory
output_path = "your/output/path"  # Specify desired output location
```

API Key Setup Replace the placeholder with your LLM provider's API key:

```
api_key = "your_api_key_here"  # e.g.deepseek,gpt4-o-mini
```

Execution Run the core pipeline with:

```
cd code;
python abstract.py --dataset <datasetname> --model <datasetname>
python normalization.py --dataset <datasetname> --model <datasetname>
```

`--dataset` refers to the dataset used for evaluation

`--model` refers to the model used for evaluation

Files

data.zip

Files (8.5 GB)

Name	Size	Download all
clone_codebert_model.bin md5:c84f1cf3f163d11ac59cfd84819b78cc	503.4 MB	Download
clone_codet5_model.bin md5:5a2f5cdf90fd6cae57dd2fb79f36b503	896.3 MB	Download
clone_codet5p_model.bin md5:235cb8a4f215ec5db7a391a7fdadb9b8	896.3 MB	Download
clone_graphcodebert_model.bin md5:4c31fa3006dc1dbe51c3504b7ec775a3	503.4 MB	Download
data.zip md5:1aa462fb0513206622cf5c9024b8f99f	130.6 MB	Preview Download
defec_codebert_model.bin md5:d5c42e593cebb63bd4140c9812711fd1	501.0 MB	Download
defec_codet5_model.bin md5:6147c266737298dd5fc3bc56b68822a5	894.0 MB	Download
defec_codet5p_model.bin md5:764d2d5999a14ddb9ed4a87ea409d52e	894.0 MB	Download
defec_graphcodebert_model.bin md5:5484053ac28d957624ea70c239827648	501.0 MB	Download
README.md md5:1b6374823b2546bb8524f2ec5e83bb77	5.0 kB	Preview Download
UNICODE.zip md5:4ca79cf7a52e612b546a7730ecc56d96	6.0 MB	Preview Download
vul_codebert_model.bin md5:d6876823e02f5bdb040715fbef9928d2	498.6 MB	Download
vul_codet5_model.bin md5:1f07268ca5eaa530c41f041e8cf61087	894.0 MB	Download
vul_codet5p_model.bin md5:a1102d19929fec387a785ddaa5f5d78f	894.0 MB	Download
vul_graphcodebert_model.bin md5:02217288e7077dc9f2bf6c4e7d6cd75e	498.6 MB	Download

	All versions	This version
Views	33	33
Downloads	1	1
Data volume	5.0 kB	5.0 kB

UNICODE

Authors/Creators

Description

Files

data.zip

Files (8.5 GB)