Published June 7, 2025 | Version v1
Software Open

Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities

  • 1. CISPA Helmholtz Center for Information Security
  • 2. ROR icon Helmholtz Center for Information Security

Description

This is the repository for the paper "Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities," where we promise to release the UnsafeConcepts dataset, finetuned LLaVA checkpoints, and the implementation code.

Disclaimer. This repo contains examples of unsafe or hateful images. Reader discretion is recommended. This repo is intended for research purposes only. Any misuse is strictly prohibited.

Overview

This repo includes:

  • Access to the UnsafeConcepts dataset, a manually annotated image dataset containing 1.5K examples;
  • Access to the VLM-generated responses, and response classifiers
  • Access to the fine-tuned checkpoints using SFT, DPO, and PPO, and their training scripts
  • Scripts for reproducing key results in the paper

Environment Setup

cd SaferVLM

bash setup.sh

conda activate llava

UnsafeConcepts Dataset

We have taken great care to share our dataset responsibly due to the presence of unsafe images. Therefore, this dataset has restricted access and is available upon requests for research purposes only.

First request the access, then download the dataset from huggingface "yiting/UnsafeConcepts"

from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="yiting/UnsafeConcepts",
    repo_type="dataset",
    local_dir="data/images")

 

Column Description
image_filename image saved path+name
category Belonged unsafe category, e.g., "Hate"
unsafe concept annotated unsafe concept, e.g., "Confederate Flag"

 

Reproduce Measurement Results

Download VLM-generated responses here.

Download response classifier:

from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="yiting/perception_classifier",
    repo_type="model",
    local_dir="checkpoints/perception_classifier")

local_dir = snapshot_download(
    repo_id="yiting/alignment_classifier",
    repo_type="model",
    local_dir="checkpoints/alignment_classifier")

python measure.py --measure_mode perception --response_dir data/VLM_responses

python measure.py --measure_mode alignment --response_dir data/VLM_responses

python measure.py --measure_mode alignment_text_only --response_dir data/VLM_responses

 

RLHF

Download and save the fined-tuned checkpoints at checkpoints/rlhf/*

from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="yiting/llava-lora",
    repo_type="model",
    local_dir="checkpoints/rlhf")

Evaluate their performance with different datasets, e.g., UnsafeConcepts_TEST

run

python eval_rlhf.py --dataset_name UnsafeConcepts_TEST --lora_path checkpoints/rlhf/sft --save_dir results/sft

python eval_rlhf.py --dataset_name UnsafeConcepts_TEST --lora_path checkpoints/rlhf/dpo --save_dir results/dpo

python eval_rlhf.py --dataset_name UnsafeConcepts_TEST --lora_path checkpoints/rlhf/ppo --save_dir results/ppo

 

To train these models, run

cd RLHF

bash scripts/train_sft.sh

bash scripts/train_dpo.sh

bash scripts/train_ppo.sh

 

 

Files

dpo_train.json

Files (70.6 MB)

Name Size Download all
md5:e7f266c286897b56ee26df7555e59c9f
15.8 kB Download
md5:d8d75992daf8346da06c5aec4e60ae10
13.9 kB Download
md5:d4edb822a82d76c43729926911fbe65a
8.9 kB Download
md5:c7d8cd4cc977a7163c456d67ec423517
10.1 kB Download
md5:ae4b9d1c12d8b056a987ca967c43d45f
2.6 kB Download
md5:f98b7b2b3bf67345399e42bef98d33a9
759.2 kB Preview Download
md5:79c5f452b97db6ed81f77695e1e71ffd
28.8 kB Download
md5:12a8ca6552ab4c5953455215e1ec893e
9.5 kB Download
md5:2135865cc03b52486710460865a626af
30.9 kB Download
md5:bb8e2b83010e886008e9d0a6bc6c265a
18.9 kB Download
md5:067305e2d73c81bd229e786c62610ce7
17.3 kB Download
md5:d382a093f749a697820d3dadd61c8428
19.7 MB Download
md5:47905efd54b8133d32e8281299f8cd27
3.9 kB Download
md5:1d64e923afa6346abb6b10cf751b5d61
4.2 kB Download
md5:fda32a97c5e2bb4845ee3f7cbcacb32d
10.7 kB Download
md5:b36b43c3f09801f5d368627fb92187c3
48.1 MB Download
md5:d15d159a437df66b40f578243ae385d1
845.0 kB Preview Download
md5:a55d0922216a3a68bc4e3eab38f6d7f7
39.9 kB Download
md5:178ac1aeba7d2c4192551a02cfcde9a1
1.1 kB Download
md5:c8350cc0845396ee125319400409034d
13.9 kB Download
md5:a1f0498d5f19b411478c9c7f8d3d422a
12.4 kB Download
md5:c41beb7052e9202e99574734b9a735b6
20.5 kB Download
md5:cd1d380bbeec7c248bd142e9ea449c30
362 Bytes Download
md5:3cf3a2a2b32169276ac9a1a1d44410e8
832.8 kB Preview Download
md5:d4c820d2f21175d3e4dc6d0db0a0462c
74.7 kB Preview Download
md5:a60819f3f324be422d94e8a910f0355a
2.1 kB Download
md5:4dcfbe45726a2023cc3171e5bdd961ad
3.4 kB Download
md5:a3de8fee826bd3b41d425e34ca15690b
1.9 kB Download
md5:02e535de7bef41edb3e0acfeca99f5c9
2.9 kB Download
md5:0fd0acd9ff568bb85df7197c68daf976
12.3 kB Download
md5:6f60adf5d046ba7c099111ad9507d4dd
358 Bytes Preview Download
md5:68d704f6c3b894f46a4e7f4bba8844d5
801 Bytes Preview Download