Published January 24, 2025
| Version v4
Publication
Open
PAPILLON: Efficient and Stealthy Fuzz Testing-Powered Jailbreaks for LLMs
Description
Artifat of Usenix 2025paper: PAPILLON: Efficient and Stealthy Fuzz Testing-Powered Jailbreaks for LLMs
## Overview

## Installation
```bash
python = 3.10
pytorch = 2.1.2+cu12.1
# requirements
pip install "fschat[model_worker,webui]"
pip install vllm
pip install openai # for openai LLM
pip install termcolor
pip install openpyxl
pip install google-generativeai # for google PALM-2
pip install anthropic # for anthropic
```
## Models
1. We use a finetuned RoBERTa-large model [huggingface](https://huggingface.co/hubert233/GPTFuzz) from [GPTFuzz](https://github.com/sherdencooper/GPTFuzz) as our judge model. Thanks to its great work!
2. For Judge model, we need to set api-key for gpt judge model:
```python
# line 106 in ./Judge/language_models.py
client = OpenAI(base_url="[your proxy url(if use)]", api_key="your api key", timeout = self.API_TIMEOUT)
```
## Datasets
We have 3 available datasets to jailbreak:
1. `datasets/questions/question_target_list.csv` : sampled from two public datasets: [llm-jailbreak-study](https://sites.google.com/view/llm-jailbreak-study) and [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf). Following the format of [GCG](https://github.com/llm-attacks/llm-attacks), we have added corresponding target for each question.
2. `datasets/questions/question_target.csv` : advbench.
3. `datasets/questions/question_target_custom.csv` : subset of advbench.
## Example to use
to jailbreak gpt-3.5-turbo on the subset of advbench:
```bash
python run.py --openai_key [your openai_key] --model_path gpt-3.5-turbo --target_model gpt-3.5-turbo
```
## eval
set `directory_path` as the directory of result, then `run eval.py` to get the ASR and AQ.
Files
Papillon_main.zip
Files
(1.1 MB)
Name | Size | Download all |
---|---|---|
md5:6c8265c857f05aa9427ab4f0ba8fcd2e
|
1.1 MB | Preview Download |