Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities
Creators
Description
This is the repository for the paper "Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities," where we promise to release the UnsafeConcepts dataset, finetuned LLaVA checkpoints, and the implementation code.
Disclaimer. This repo contains examples of unsafe or hateful images. Reader discretion is recommended. This repo is intended for research purposes only. Any misuse is strictly prohibited.
Overview
This repo includes:
- Access to the UnsafeConcepts dataset, a manually annotated image dataset containing 1.5K examples;
- Access to the VLM-generated responses, and response classifiers
- Access to the fine-tuned checkpoints using SFT, DPO, and PPO, and their training scripts
- Scripts for reproducing key results in the paper
Environment Setup
cd SaferVLM
bash setup.sh
conda activate llava
UnsafeConcepts Dataset
We have taken great care to share our dataset responsibly due to the presence of unsafe images. Therefore, this dataset has restricted access and is available upon requests for research purposes only.
First request the access, then download the dataset from huggingface "yiting/UnsafeConcepts"
from huggingface_hub import snapshot_download
local_dir = snapshot_download(
repo_id="yiting/UnsafeConcepts",
repo_type="dataset",
local_dir="data/images")
Column | Description |
image_filename | image saved path+name |
category | Belonged unsafe category, e.g., "Hate" |
unsafe concept | annotated unsafe concept, e.g., "Confederate Flag" |
Reproduce Measurement Results
Download VLM-generated responses here.
Download response classifier:
from huggingface_hub import snapshot_download
local_dir = snapshot_download(
repo_id="yiting/perception_classifier",
repo_type="model",
local_dir="checkpoints/perception_classifier")
local_dir = snapshot_download(
repo_id="yiting/alignment_classifier",
repo_type="model",
local_dir="checkpoints/alignment_classifier")
python measure.py --measure_mode perception --response_dir data/VLM_responses
python measure.py --measure_mode alignment --response_dir data/VLM_responses
python measure.py --measure_mode alignment_text_only --response_dir data/VLM_responses
RLHF
Download and save the fined-tuned checkpoints at checkpoints/rlhf/*
from huggingface_hub import snapshot_download
local_dir = snapshot_download(
repo_id="yiting/llava-lora",
repo_type="model",
local_dir="checkpoints/rlhf")
Evaluate their performance with different datasets, e.g., UnsafeConcepts_TEST
run
python eval_rlhf.py --dataset_name UnsafeConcepts_TEST --lora_path checkpoints/rlhf/sft --save_dir results/sft
python eval_rlhf.py --dataset_name UnsafeConcepts_TEST --lora_path checkpoints/rlhf/dpo --save_dir results/dpo
python eval_rlhf.py --dataset_name UnsafeConcepts_TEST --lora_path checkpoints/rlhf/ppo --save_dir results/ppo
To train these models, run
cd RLHF
bash scripts/train_sft.sh
bash scripts/train_dpo.sh
bash scripts/train_ppo.sh
Files
dpo_train.json
Files
(70.6 MB)
Name | Size | Download all |
---|---|---|
md5:e7f266c286897b56ee26df7555e59c9f
|
15.8 kB | Download |
md5:d8d75992daf8346da06c5aec4e60ae10
|
13.9 kB | Download |
md5:d4edb822a82d76c43729926911fbe65a
|
8.9 kB | Download |
md5:c7d8cd4cc977a7163c456d67ec423517
|
10.1 kB | Download |
md5:ae4b9d1c12d8b056a987ca967c43d45f
|
2.6 kB | Download |
md5:f98b7b2b3bf67345399e42bef98d33a9
|
759.2 kB | Preview Download |
md5:79c5f452b97db6ed81f77695e1e71ffd
|
28.8 kB | Download |
md5:12a8ca6552ab4c5953455215e1ec893e
|
9.5 kB | Download |
md5:2135865cc03b52486710460865a626af
|
30.9 kB | Download |
md5:bb8e2b83010e886008e9d0a6bc6c265a
|
18.9 kB | Download |
md5:067305e2d73c81bd229e786c62610ce7
|
17.3 kB | Download |
md5:d382a093f749a697820d3dadd61c8428
|
19.7 MB | Download |
md5:47905efd54b8133d32e8281299f8cd27
|
3.9 kB | Download |
md5:1d64e923afa6346abb6b10cf751b5d61
|
4.2 kB | Download |
md5:fda32a97c5e2bb4845ee3f7cbcacb32d
|
10.7 kB | Download |
md5:b36b43c3f09801f5d368627fb92187c3
|
48.1 MB | Download |
md5:d15d159a437df66b40f578243ae385d1
|
845.0 kB | Preview Download |
md5:a55d0922216a3a68bc4e3eab38f6d7f7
|
39.9 kB | Download |
md5:178ac1aeba7d2c4192551a02cfcde9a1
|
1.1 kB | Download |
md5:c8350cc0845396ee125319400409034d
|
13.9 kB | Download |
md5:a1f0498d5f19b411478c9c7f8d3d422a
|
12.4 kB | Download |
md5:c41beb7052e9202e99574734b9a735b6
|
20.5 kB | Download |
md5:cd1d380bbeec7c248bd142e9ea449c30
|
362 Bytes | Download |
md5:3cf3a2a2b32169276ac9a1a1d44410e8
|
832.8 kB | Preview Download |
md5:d4c820d2f21175d3e4dc6d0db0a0462c
|
74.7 kB | Preview Download |
md5:a60819f3f324be422d94e8a910f0355a
|
2.1 kB | Download |
md5:4dcfbe45726a2023cc3171e5bdd961ad
|
3.4 kB | Download |
md5:a3de8fee826bd3b41d425e34ca15690b
|
1.9 kB | Download |
md5:02e535de7bef41edb3e0acfeca99f5c9
|
2.9 kB | Download |
md5:0fd0acd9ff568bb85df7197c68daf976
|
12.3 kB | Download |
md5:6f60adf5d046ba7c099111ad9507d4dd
|
358 Bytes | Preview Download |
md5:68d704f6c3b894f46a4e7f4bba8844d5
|
801 Bytes | Preview Download |