HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

Shen, Xinyue; Wu, Yixin; Qu, Yiting; Backes, Michael; Zannettou, Savvas; Zhang, Yang

doi:10.5281/zenodo.14840447

Published January 23, 2025 | Version 0.2

Software Restricted

HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

1. CISPA Helmholtz Center for Information Security
2. Delft University of Technology

HateBench

This is the official repository for the paper "HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns."

In this paper, we propose HateBench, a framework designed to benchmark hate speech detectors on LLM-generated content.

Disclaimer. This repo contains examples of hateful and abusive language. Reader discretion is recommended. This repo is intended for research purposes only. Any misuse is strictly prohibited.

Overview

Our artifact repository includes:

HateBench, the framework designed to benchmark hate speech detectors on LLM-generated content.
HateBenchSet, the manually-annotated dataset, comprising 7,838 samples across 34 identity groups, generated by LLMs.
Code for reproducing the LLM hate campaign, including both the adversarial hate campaign and stealthy hate campaign.
Scripts to generate the key result tables and figures from the paper, including:
- Table 3: Performance on LLM-generated samples.
- Table 4: F1-score on LLM-generated and human-written samples.
- Table 6: Performance of adversarial hate campaign
- Table 8: Performance of model stealing attacks.
- Table 9: Performance of stealthy hate campaign with black-box attacks.
- Table 10: Performance of stealthy hate campaign with white-box gradient optimization.

Environment Requirements

All our experiments are tested in a conda environment on Ubuntu 20.04.6 LTS with Python 3.9.0. Tables 3 and 4 can be reproduced directly on a local PC without requiring a GPU environment. For other experiments, we recommend using an environment with NVIDIA GeForce RTX 3090 or more powerful GPUs, such as the RTX 4090 or A100. The results presented in this paper were obtained using an NVIDIA GeForce RTX 3090.

Environment Setup

conda create -n hatebench python=3.9.0
conda activate hatebench
pip install -r requirements.txt

Then python

import nltk
nltk.download('averaged_perceptron_tagger_eng')
exit()

HateBench

HateBenchSet

HateBenchSet is provided in measurement/data/HateBenchSet.csv.

Column	Description
model	Model used to generate responses.
status	Status of the model, i.e., `original` or `jailbreak`.
status_prompt	Prompt used to set the model.
main_target	The category of identity groups, e.g., race, religion, etc.
sub_target	The identity group.
target_name	The complete name of the identity group.
pid	Prompt id.
prompt	The prompt.
text	The sample generated by the model.
hate_label	`1` denotes `Hate`, `0` refers to `Non-Hate`.

Besides, we also provide measurement/data/HateBenchSet_labeled.csv, which is HateBenchSet with the predictions of the six detectors evaluated in our paper.

Specifically, for each detector, the predictions are recorded in the following columns:

{detector}: the complete record returned by the detector.
{detector}_score: the hate score of the sample.
{detector}_flagged: whether the sample is predicted as hate or not.

Reproduce Paper results

Table 3:

python measurement/calculate_detector_performance.py

Table 4:

python measurement/calculate_detector_LLM_performance.py

LLM-Driven Hate Campaign

Adversarial Hate Campaign (Table 6)

During our experiment, we consider three target models: Perspective, Moderation, and TweetHate. The first two are commercial models, while the last one is an open-source model. Considering the potential ethical risks and the need for API keys when attacking commercial models, we provide a script to reproduce the results of TweetHate presented in Table 6.

cd hate_campaign
bash scripts/run_adversarial_hate_campaign.sh

Results will be automatically stored in the ./logs/ directory. The naming convention for the log files is adv_hate_campaign_{target_model}_{attack}.log. For example, to check the results of TextFooler attack on TweetHate model, refer to the log file named adv_hate_campaign_TweetHate_TextFooler.log and the results are printed at the end of the log.

----------------------------------------
ARGUMENT        VALUE     
attack_strategy adv_hate_campaign
target_model    TweetHate 
attack_method   textfooler
dataset         HateBench 
num_examples    120       
...
----------------------------------------

[Succeeded / Failed / Skipped / Total] 117 / 3 / 0 / 120: 100%|██████████████████████████████████████████████████████████████████| 120/120 [03:15<00:00,  1.63s/it]

...

====== Attack Summary ======
           textfooler
ASR             0.975
WMR             0.115
USE             0.903
Meteor          0.916
Fluency        89.657
# Queries     207.920
Time            1.142
============================

END

Note:

The Time metric represents the average query time (in seconds) and is significantly influenced by the performance of the GPU. The Time results in the paper were obtained using an NVIDIA GeForce RTX 3090.
Paraphrase Attack relies on LLM to generate adversarial hate speech, therefore the metric results would be more unstable than other adversarial attacks.

Stealthy Hate Campaign

Steal the target model (Table 8)

nohup python model_stealing.py --target_model TweetHate --surrogate_model roberta  > ./logs/TweetHate_roberta.log & 
nohup python model_stealing.py --target_model TweetHate --surrogate_model bert  > ./logs/TweetHate_bert.log &

Check the end of the log file to view the results of model stealing attacks for each target model and surrogate model.

-------------------- EPOCH 9 --------------------
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 262/262 [01:42<00:00,  2.57it/s]
Training Loss: 0.003955764816971744
Training Agreement: 99.25039872408293
Training Accuracy: 85.56618819776715

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 196/196 [00:24<00:00,  8.10it/s]
Validation Loss: 0.043098581649116695
Validation Agreement: 92.53826530612245
Validation Ground Truth Accuracy: 85.0765306122449

All files saved

Attack (Tables 9 and 10)

After generating the surrogate model, run the stealthy hate campaign.

bash scripts/run_stealthy_hate_campaign.sh

Results will be automatically stored in the ./logs/ directory. The naming convention for the log files is stealthy_hate_campaign_{target_model}_{surrogate_model_arch}_{attack}.log.

Here, we evaluate two attacks: textfooler refers as the black-box attack and textfooler_gradient as the white-box attack. The results are printed at the end of the log.

====== Attack Summary ======
               textfooler_gradient
ASR (S)                      0.950
ASR (T)                      0.588
WMR                          0.166
USE                          0.843
Meteor                       0.882
Fluency                     91.502
# Queries (S)              228.980
# Queries (T)                1.000
Time (S)                     1.375
Time (T)                     0.376
============================

Evaluate Perspective and Moderation

If you wanna evaluate Perspective or Moderation API, first follow their platform's instructions to obtain the API key and fill in hate_campaign/api_key.py.

Perspective API: https://developers.perspectiveapi.com/s/docs-get-started?language=en_US
OpenAI Moderation API: https://platform.openai.com/docs/guides/moderation/overview

Adversarial Hate Campaign

bash scripts/run_adversarial_hate_campaign_Perspective.sh
bash scripts/run_adversarial_hate_campaign_Moderation.sh

Stealthy Hate Campaign

Model stealing:

nohup python model_stealing.py --target_model Perspective --surrogate_model roberta  > ./logs/Perspective_roberta.log & 
nohup python model_stealing.py --target_model Perspective --surrogate_model bert  > ./logs/Perspective_bert.log & 
nohup python model_stealing.py --target_model Moderation --surrogate_model roberta  > ./logs/Moderation_roberta.log & 
nohup python model_stealing.py --target_model Moderation --surrogate_model bert  > ./logs/Moderation_bert.log &

After generating corresponding surrogate models, run the following script to conduct the stealthy hate campaign.

bash scripts/run_stealthy_hate_campaign_Perspective.sh
bash scripts/run_stealthy_hate_campaign_Moderation.sh

Ethics & Disclosure

Our work relies on LLMs to generate samples, and all the manual annotations are performed by the authors of this study. Therefore our study is not considered human subjects research by our Institutional Review Board (IRB). Also, by doing annotations ourselves, we ensure that no human subjects were exposed to harmful information during our study. Since our work involves the assessment of LLM-driven hate campaigns, it is inevitable to disclose how attackers can evade a hate speech detector. We have taken great care to responsibly share our findings. We disclosed the paper and the labeled dataset to OpenAI, Google Jigsaw, and the developers of open-source detectors. In our disclosure letter, we explicitly highlighted the high attack success rates in the LLM-driven hate campaigns. We have received the acknowledgment from OpenAI and Google Jigsaw.

This repo is intended for research purposes only. Any misuse is strictly prohibited.

Citation

If you find this useful in your research, please consider citing:

@inproceedings{SWQBZZ25,
  author = {Xinyue Shen and Yixin Wu and Yiting Qu and Michael Backes and Savvas Zannettou and Yang Zhang},
  title = {{HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns}},
  booktitle = {{USENIX Security Symposium (USENIX Security)}},
  publisher = {USENIX},
  year = {2025}
}

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

By providing your email and name, you agree to use the code and dataset for research purposes only. Any misuse is strictly prohibited.

Please ensure you use your institutional email address for the application. Thank you!

You are currently not logged in. Do you have an account? Log in here

Additional details

Repository URL: https://github.com/TrustAIRLab/HateBench
Programming language: Python
Development Status: Active

	All versions	This version
Views	298	228
Downloads	20	14
Data volume	427.9 MB	269.4 MB

HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

Creators

Description

HateBench

Overview

Environment Requirements

Environment Setup

HateBench

HateBenchSet

Reproduce Paper results

LLM-Driven Hate Campaign

Adversarial Hate Campaign (Table 6)

Stealthy Hate Campaign

Evaluate Perspective and Moderation

Ethics & Disclosure

Citation

Files

Restricted

Request access

Additional details

Software