Published January 24, 2025 | Version v4
Publication Open

PAPILLON: Efficient and Stealthy Fuzz Testing-Powered Jailbreaks for LLMs

  • 1. ROR icon Wuhan University

Description

Artifat of Usenix 2025paper: PAPILLON: Efficient and Stealthy Fuzz Testing-Powered Jailbreaks for LLMs 


## Overview

![overview.png](./overview.png)

## Installation

```bash
python = 3.10
pytorch = 2.1.2+cu12.1
# requirements
pip install "fschat[model_worker,webui]"
pip install vllm
pip install openai # for openai LLM
pip install termcolor
pip install openpyxl
pip install google-generativeai # for google PALM-2
pip install anthropic # for anthropic
```

## Models

1. We use a finetuned RoBERTa-large model [huggingface](https://huggingface.co/hubert233/GPTFuzz) from [GPTFuzz](https://github.com/sherdencooper/GPTFuzz) as our judge model. Thanks to its great work!

2. For Judge model, we need to set api-key for gpt judge model:

```python
# line 106 in ./Judge/language_models.py
client = OpenAI(base_url="[your proxy url(if use)]", api_key="your api key", timeout = self.API_TIMEOUT)
```

 

## Datasets

We have 3 available datasets to jailbreak:

1. `datasets/questions/question_target_list.csv` : sampled from two public datasets: [llm-jailbreak-study](https://sites.google.com/view/llm-jailbreak-study) and [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf). Following the format of [GCG](https://github.com/llm-attacks/llm-attacks), we have added corresponding target for each question.
2. `datasets/questions/question_target.csv` : advbench.

3. `datasets/questions/question_target_custom.csv` : subset of advbench.

## Example to use

to jailbreak gpt-3.5-turbo on the subset of advbench:

```bash
python run.py --openai_key [your openai_key] --model_path gpt-3.5-turbo --target_model gpt-3.5-turbo
```
## eval
set `directory_path` as the directory of result, then `run eval.py` to get the ASR and AQ.

 

Files

Papillon_main.zip

Files (1.1 MB)

Name Size Download all
md5:6c8265c857f05aa9427ab4f0ba8fcd2e
1.1 MB Preview Download