Quantifying Large Language Model Attacks Through the Lens of Model Cognition

Xiuming, Liu; Chaoxiang, He; Xuanran, Yu; Jichen, Chai; Feiyue, Xu; Sheng, Hang; Hanqing, Hu; Bin Benjamin, Zhu; Hongsheng, Hu; Shi-Feng, Sun; Dawu, Gu; Shuo, Wang

doi:10.5281/zenodo.18980487

Published March 12, 2026 | Version v6

Conference paper Open

Quantifying Large Language Model Attacks Through the Lens of Model Cognition

1. Shanghai Jiao Tong University
2. Microsoft Research Asia (China)

Artifact Evaluation: Quantifying Large Language Model Attacks Through the Lens of Model Cognition

Paper ID: #1781 (USENIX Security '26) Title: Quantifying Large Language Model Attacks Through the Lens of Model Cognition

📖 Overview

This repository contains the artifact for the paper "Quantifying Large Language Model Attacks Through the Lens of Model Cognition" (USENIX Security 2026). It provides all necessary data, code, and scripts to validate our claims and reproduce the experimental results reported in the paper.

We provide two primary modes for evaluation:

Instant Verification (src/quick_start.py): A CLI tool to instantly query and retrieve specific experimental results (Accuracy, AUC, etc.) directly from the pre-computed logs used in the paper.
Claim Reproduction (src/claims/*.ipynb): A set of modular Jupyter Notebooks that execute the actual pipeline—from training probes to evaluating sentinels—allowing for deep inspection and reproduction of specific claims.

📂 Directory Structure

 .
 ├── data/           # Datasets used for training probes and conducting attacks (e.g., fixed.json, adversarial prompts).
 ├── models/         # Directory where LLMs and baseline models will be downloaded.
 ├── results/        # JSON files containing the finalized experimental metrics reported in the paper.
 └── src/            # Source code and executable scripts.
     ├── claims/            # Modular notebooks for verifying individual claims (C1-C4).
     │   ├── claim1.ipynb   # Layer-wise Separability
     │   ├── claim2.ipynb   # Cognitive Drift
     │   ├── claim3.ipynb   # Sentinel Construction
     │   └── claim4.ipynb   # Baseline Comparison
     ├── installation.ipynb # Setup script for environment and model downloading.
     ├── basic_test.py      # Script to verify GPU and dependency status.
     └── quick_start.py     # CLI tool to query results directly from the 'results/' folder.

💻 Hardware & Performance Reference

The code is optimized to run on standard research hardware. The provided reproduction example uses Qwen3-4B.

Recommended GPU: 1x NVIDIA A100 (40GB VRAM) is fully sufficient.
Estimated Runtime: Approximately 3 hours for the complete pipeline (Taking Qwen3-4B as an Example, Model download \to Extraction \to Training \to Evaluation).

🚀 Getting Started

1. Installation & Setup

We rely on Conda for environment management. Please follow these steps:

Open and run the src/installation.ipynb notebook.
- It will create the lac environment (Python 3.10).
- It will install all dependencies.
- It will download the required models (Qwen3-4B, Llama-Guard-3-8B, etc.).
Note: Ensure you activate the lac kernel for all subsequent notebooks.

2. Basic Functionality Test

To ensure your environment and GPU are configured correctly before running heavy experiments:

 conda activate lac
 python src/basic_test.py

Expected Output: "Ready for reproduction!" (along with GPU details).

🧪 Evaluation Modes

Mode A: Instant Result Verification (Experiment E5)

If you wish to quickly verify specific numbers cited in the paper (e.g., Table 2) without running the training pipeline, use the src/quick_start.py script to parse pre-computed logs.

Usage:

 conda activate lac
 python src/quick_start.py --model Qwen3-4B --method Multi-layer --dataset Sneaky

Run python quick_start.py --help for full options.

Mode B: Claim Verification & Reproduction (Experiments E1-E4)

To reproduce the experiments and verify specific major claims, run the corresponding notebooks in the src/claims/ directory. Detailed step-by-step instructions and code explanations are provided directly within each notebook.

Claim (Cx)	Description	Experiment	Notebook
C1	Layer-wise Separability: Toxic intent is separable (AUC \ge 0.90) in mid-depth layers.	E1	`src/claims/claim1.ipynb`
C2	Cognitive Drift: Adversarial perturbations cause significant hidden state divergence correlating with attack success.	E2	`src/claims/claim2.ipynb`
C3	Sentinel Effectiveness: Multi-layer sentinel achieves >94% accuracy, outperforming single layers.	E3	`src/claims/claim3.ipynb`
C4	Superiority: Our method outperforms Llama-Guard-3-8B and other baselines on stealthy attacks.	E4	`src/claims/claim4.ipynb`

⚙️ Customization

The default scripts use Qwen3-4B. To reproduce results for other models mentioned in the paper (e.g., Llama-3.1-8B-Instruction), you need to:

Download Model: Modify model_ids in src/installation.ipynb to download the target model.
Update Model Path: Update the MODEL_NAME variable in the respective src/claims/claim*.ipynb notebooks.
Update Output Path: Change the subfolder_name variable (e.g., to llama_8b) in the notebooks to ensure results are saved in a separate directory.
Update Layer Count: Adjust the layer range loop (e.g., range(0, 36)) to match the total number of hidden layers of the new model (e.g., Qwen3-4B has 36 layers, while Llama-3.1-8B-Instruction has 32 layers).

🔗 Correspondence with Open Science Policy

In accordance with the USENIX Security 2026 open-science policy, this artifact fulfills the specific commitments made in the Open Science section of our paper:

Open Science Commitment	Corresponding Artifact Component
1. Source Code	`src/` Folder: Contains the full codebase. The `src/claims/` notebooks provide the probing framework implementation.
2. Data Access	`data/` Folder: Contains scripts and pre-processed files for training sets (NSFW-56k/GPT-4o) and benchmarks (I2P, Sneaky, MMA, Labelled).
3. Probe Training	`src/claims/claim1.ipynb`: Contains the exact training logic and hyperparameters to train probes from scratch locally.
4. Hidden States	`src/claims/claim1.ipynb`: Demonstrates on-the-fly extraction from local LLMs, verifying no reliance on cached proprietary tensors.
5. Reproducibility	`src/quick_start.py` & `src/claims/`: Allows both instant verification of Table 2 data and full regeneration of results (Figure 4, Figure 5, Table 2).

Files

_Security_26_AE__Quantifying_Large_Language_Model_Attacks_Through_the_Lens_of_Model_Cognition.pdf

Files (1.3 MB)

Name	Size	Download all
_Security_26_AE__Quantifying_Large_Language_Model_Attacks_Through_the_Lens_of_Model_Cognition.pdf md5:c03a47ae53a8abe3c8ef0caef51a52c6	150.0 kB	Preview Download
LLM-Attack-Cognition-AE-main.zip md5:f341e1b5b10a43b1f802db25336ed0fa	1.1 MB	Preview Download

Additional details

Available: 2025-12-10

Repository URL: https://github.com/lxmliu2002/LLM-Attack-Cognition-AE
Programming language: Python

	All versions	This version
Views	279	41
Downloads	157	9
Data volume	51.0 MB	3.4 MB

Artifact Evaluation: Quantifying Large Language Model Attacks Through the Lens of Model Cognition

📖 Overview

📂 Directory Structure

💻 Hardware & Performance Reference

🚀 Getting Started

1. Installation & Setup

2. Basic Functionality Test

🧪 Evaluation Modes

Mode A: Instant Result Verification (Experiment E5)

Mode B: Claim Verification & Reproduction (Experiments E1-E4)

⚙️ Customization

🔗 Correspondence with Open Science Policy

_Security_26_AE__Quantifying_Large_Language_Model_Attacks_Through_the_Lens_of_Model_Cognition.pdf

Files (1.3 MB)

Dates

Software

Quantifying Large Language Model Attacks Through the Lens of Model Cognition

Authors/Creators

Description

Artifact Evaluation: Quantifying Large Language Model Attacks Through the Lens of Model Cognition

📖 Overview

📂 Directory Structure

💻 Hardware & Performance Reference

🚀 Getting Started

1. Installation & Setup

2. Basic Functionality Test

🧪 Evaluation Modes

Mode A: Instant Result Verification (Experiment E5)

Mode B: Claim Verification & Reproduction (Experiments E1-E4)

⚙️ Customization

🔗 Correspondence with Open Science Policy

Files

_Security_26_AE__Quantifying_Large_Language_Model_Attacks_Through_the_Lens_of_Model_Cognition.pdf

Files (1.3 MB)

Additional details

Dates

Software