Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities

Qu, Yiting; Backes, Michael; Zhang, Yang

doi:10.5281/zenodo.15613562

Published June 7, 2025 | Version v1

Software Open

Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities

1. CISPA Helmholtz Center for Information Security
2. Helmholtz Center for Information Security

This is the repository for the paper "Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities," where we promise to release the UnsafeConcepts dataset, finetuned LLaVA checkpoints, and the implementation code.

Disclaimer. This repo contains examples of unsafe or hateful images. Reader discretion is recommended. This repo is intended for research purposes only. Any misuse is strictly prohibited.

Overview

This repo includes:

Access to the UnsafeConcepts dataset, a manually annotated image dataset containing 1.5K examples;
Access to the VLM-generated responses, and response classifiers
Access to the fine-tuned checkpoints using SFT, DPO, and PPO, and their training scripts
Scripts for reproducing key results in the paper

Environment Setup

cd SaferVLM

bash setup.sh

conda activate llava

UnsafeConcepts Dataset

We have taken great care to share our dataset responsibly due to the presence of unsafe images. Therefore, this dataset has restricted access and is available upon requests for research purposes only.

First request the access, then download the dataset from huggingface "yiting/UnsafeConcepts"

from huggingface_hub import snapshot_download

local_dir = snapshot_download(
repo_id="yiting/UnsafeConcepts",
repo_type="dataset",
local_dir="data/images")

Column	Description
image_filename	image saved path+name
category	Belonged unsafe category, e.g., "Hate"
unsafe concept	annotated unsafe concept, e.g., "Confederate Flag"

Reproduce Measurement Results

Download VLM-generated responses here.

Download response classifier:

from huggingface_hub import snapshot_download

local_dir = snapshot_download(
repo_id="yiting/perception_classifier",
repo_type="model",
local_dir="checkpoints/perception_classifier")

local_dir = snapshot_download(
repo_id="yiting/alignment_classifier",
repo_type="model",
local_dir="checkpoints/alignment_classifier")

python measure.py --measure_mode perception --response_dir data/VLM_responses

python measure.py --measure_mode alignment --response_dir data/VLM_responses

python measure.py --measure_mode alignment_text_only --response_dir data/VLM_responses

RLHF

Download and save the fined-tuned checkpoints at checkpoints/rlhf/*

from huggingface_hub import snapshot_download

local_dir = snapshot_download(
repo_id="yiting/llava-lora",
repo_type="model",
local_dir="checkpoints/rlhf")

Evaluate their performance with different datasets, e.g., UnsafeConcepts_TEST

run

python eval_rlhf.py --dataset_name UnsafeConcepts_TEST --lora_path checkpoints/rlhf/sft --save_dir results/sft

python eval_rlhf.py --dataset_name UnsafeConcepts_TEST --lora_path checkpoints/rlhf/dpo --save_dir results/dpo

python eval_rlhf.py --dataset_name UnsafeConcepts_TEST --lora_path checkpoints/rlhf/ppo --save_dir results/ppo

To train these models, run

cd RLHF

bash scripts/train_sft.sh

bash scripts/train_dpo.sh

bash scripts/train_ppo.sh

Files

dpo_train.json

Files (70.6 MB)

Name	Size	Download all
common_utils.py md5:e7f266c286897b56ee26df7555e59c9f	15.8 kB	Download
data_utils_dpo.py md5:d8d75992daf8346da06c5aec4e60ae10	13.9 kB	Download
data_utils_ppo.py md5:d4edb822a82d76c43729926911fbe65a	8.9 kB	Download
data_utils_sft.py md5:c7d8cd4cc977a7163c456d67ec423517	10.1 kB	Download
distributed_utils.py md5:ae4b9d1c12d8b056a987ca967c43d45f	2.6 kB	Download
dpo_train.json md5:f98b7b2b3bf67345399e42bef98d33a9	759.2 kB	Preview Download
dpo_trainer.py md5:79c5f452b97db6ed81f77695e1e71ffd	28.8 kB	Download
eval_rlhf.py md5:12a8ca6552ab4c5953455215e1ec893e	9.5 kB	Download
finetune_lora_dpo.py md5:2135865cc03b52486710460865a626af	30.9 kB	Download
finetune_lora_ppo.py md5:bb8e2b83010e886008e9d0a6bc6c265a	18.9 kB	Download
finetune_lora_sft.py md5:067305e2d73c81bd229e786c62610ce7	17.3 kB	Download
LLaVABench.tsv md5:d382a093f749a697820d3dadd61c8428	19.7 MB	Download
lora_model.py md5:47905efd54b8133d32e8281299f8cd27	3.9 kB	Download
lora_utils.py md5:1d64e923afa6346abb6b10cf751b5d61	4.2 kB	Download
measure.py md5:fda32a97c5e2bb4845ee3f7cbcacb32d	10.7 kB	Download
MME.tsv md5:b36b43c3f09801f5d368627fb92187c3	48.1 MB	Download
ppo_train.json md5:d15d159a437df66b40f578243ae385d1	845.0 kB	Preview Download
ppo_trainer.py md5:a55d0922216a3a68bc4e3eab38f6d7f7	39.9 kB	Download
pyproject.toml md5:178ac1aeba7d2c4192551a02cfcde9a1	1.1 kB	Download
reward_model.py md5:c8350cc0845396ee125319400409034d	13.9 kB	Download
rl_models.py md5:a1f0498d5f19b411478c9c7f8d3d422a	12.4 kB	Download
rl_trainer.py md5:c41beb7052e9202e99574734b9a735b6	20.5 kB	Download
setup.sh md5:cd1d380bbeec7c248bd142e9ea449c30	362 Bytes	Download
sft_train.json md5:3cf3a2a2b32169276ac9a1a1d44410e8	832.8 kB	Preview Download
test_data.json md5:d4c820d2f21175d3e4dc6d0db0a0462c	74.7 kB	Preview Download
train_dpo.sh md5:a60819f3f324be422d94e8a910f0355a	2.1 kB	Download
train_ppo.sh md5:4dcfbe45726a2023cc3171e5bdd961ad	3.4 kB	Download
train_sft.sh md5:a3de8fee826bd3b41d425e34ca15690b	1.9 kB	Download
trainer_utils.py md5:02e535de7bef41edb3e0acfeca99f5c9	2.9 kB	Download
unsafe_datasets.py md5:0fd0acd9ff568bb85df7197c68daf976	12.3 kB	Download
zero2.json md5:6f60adf5d046ba7c099111ad9507d4dd	358 Bytes	Preview Download
zero3.json md5:68d704f6c3b894f46a4e7f4bba8844d5	801 Bytes	Preview Download

	All versions	This version
Views	23	23
Downloads	101	101
Data volume	141.6 MB	141.6 MB

Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities

Creators

Description

Overview

Environment Setup

UnsafeConcepts Dataset

Reproduce Measurement Results

RLHF

Files

dpo_train.json

Files (70.6 MB)