TCAB: Text Classification Attack Benchmark Dataset
Creators
- 1. University of California Irvine
- 2. University of California San Diego
- 3. University of Oregon
Description
TCAB is a large collection of successful adversarial attacks on state-of-the-art text classification models trained on multiple sentiment and abuse domain datasets.
The dataset is broken up into 2 files: train.csv and val.csv. The training set contains 1,448,751 instances (552,364 are "clean" unperturbed instances) and the validation set contains 482,914 instances (178,607 are "clean"). Each instance contains the following attributes:
scenario: Domain, either abuse or sentiment.
target_model_dataset: Dataset being attacked.
target_model_train_dataset: Dataset the target model trained on.
target_model: Type of victim model (e.g., bert, roberta, xlnet).
attack_toolchain: Open-source attack toolchain, either TextAttack or OpenAttack.
attack_name: Name of the attack method.
original_text: Original input text.
original_output: Prediction probabilities of the target model on the original text.
ground_truth: Encoded label for the original task of the domain dataset. 1 and 0 means toxic and toxic for abuse datasets, respectively. 1 and 0 means positive and negative sentiment for sentiment datasets. If there is a neutral sentiment, then 2, 1, 0 means positive, neutral, and negative sentiment.
status: Unperturbed example if "clean"; successful adversarial attack if "success".
perturbed_text: Text after it has been perturbed by an attack.
perturbed_output: Prediction probabilities of the target model on the perturbed text.
attack_time: Time taken to execute the attack.
num_queries: Number of queries performed while attacking.
frac_words_changed: Fraction of words changed due to an attack.
test_index: Index of each unique source example (original instance) (LEGACY - necessary for backwards compatibility).
original_text_identifier: Index of each unique source example (original instance).
unique_src_instance_identifier: Primary key to uniquely identify to every source instance; comprised of (target_model_dataset, test_index, original_text_identifier).
pk: Primary key to uniquely identify every attack instance; comprised of (attack_name, attack_toolchain, original_text_identifier, scenario, target_model, target_model_dataset, test_index).