Quantifying Large Language Model Attacks Through the Lens of Model Cognition
Authors/Creators
Description
Artifact Evaluation: Quantifying Large Language Model Attacks Through the Lens of Model Cognition
Paper ID: #1781 (USENIX Security '26) Title: Quantifying Large Language Model Attacks Through the Lens of Model Cognition
đź“– Overview
This repository contains the artifact for the paper "Quantifying Large Language Model Attacks Through the Lens of Model Cognition" (USENIX Security 2026). It provides all necessary data, code, and scripts to validate our claims and reproduce the experimental results reported in the paper.
We provide two primary modes for evaluation:
-
Instant Verification (
src/quick_start.py): A CLI tool to instantly query and retrieve specific experimental results (Accuracy, AUC, etc.) directly from the pre-computed logs used in the paper. -
Claim Reproduction (
src/claims/*.ipynb): A set of modular Jupyter Notebooks that execute the actual pipeline—from training probes to evaluating sentinels—allowing for deep inspection and reproduction of specific claims.
đź“‚ Directory Structure
.
├── data/ # Datasets used for training probes and conducting attacks (e.g., fixed.json, adversarial prompts).
├── models/ # Directory where LLMs and baseline models will be downloaded.
├── results/ # JSON files containing the finalized experimental metrics reported in the paper.
└── src/ # Source code and executable scripts.
├── claims/ # Modular notebooks for verifying individual claims (C1-C4).
│ ├── claim1.ipynb # Layer-wise Separability
│ ├── claim2.ipynb # Cognitive Drift
│ ├── claim3.ipynb # Sentinel Construction
│ └── claim4.ipynb # Baseline Comparison
├── installation.ipynb # Setup script for environment and model downloading.
├── basic_test.py # Script to verify GPU and dependency status.
└── quick_start.py # CLI tool to query results directly from the 'results/' folder.
đź’» Hardware & Performance Reference
The code is optimized to run on standard research hardware. The provided reproduction example uses Qwen3-4B.
-
Recommended GPU: 1x NVIDIA A100 (40GB VRAM) is fully sufficient.
-
Estimated Runtime: Approximately 3 hours for the complete pipeline (Taking Qwen3-4B as an Example, Model download \to Extraction \to Training \to Evaluation).
🚀 Getting Started
1. Installation & Setup
We rely on Conda for environment management. Please follow these steps:
-
Open and run the
src/installation.ipynbnotebook.-
It will create the lac environment (Python 3.10).
-
It will install all dependencies.
-
It will download the required models (Qwen3-4B, Llama-Guard-3-8B, etc.).
-
-
Note: Ensure you activate the lac kernel for all subsequent notebooks.
2. Basic Functionality Test
To ensure your environment and GPU are configured correctly before running heavy experiments:
conda activate lac
python src/basic_test.py
Expected Output: "Ready for reproduction!" (along with GPU details).
đź§Ş Evaluation Modes
Mode A: Instant Result Verification (Experiment E5)
If you wish to quickly verify specific numbers cited in the paper (e.g., Table 2) without running the training pipeline, use the src/quick_start.py script to parse pre-computed logs.
Usage:
conda activate lac
python src/quick_start.py --model Qwen3-4B --method Multi-layer --dataset Sneaky
Run python quick_start.py --help for full options.
Mode B: Claim Verification & Reproduction (Experiments E1-E4)
To reproduce the experiments and verify specific major claims, run the corresponding notebooks in the src/claims/ directory. Detailed step-by-step instructions and code explanations are provided directly within each notebook.
| Claim (Cx) | Description | Experiment | Notebook |
|---|---|---|---|
| C1 | Layer-wise Separability: Toxic intent is separable (AUC \ge 0.90) in mid-depth layers. | E1 | src/claims/claim1.ipynb |
| C2 | Cognitive Drift: Adversarial perturbations cause significant hidden state divergence correlating with attack success. | E2 | src/claims/claim2.ipynb |
| C3 | Sentinel Effectiveness: Multi-layer sentinel achieves >94% accuracy, outperforming single layers. | E3 | src/claims/claim3.ipynb |
| C4 | Superiority: Our method outperforms Llama-Guard-3-8B and other baselines on stealthy attacks. | E4 | src/claims/claim4.ipynb |
⚙️ Customization
The default scripts use Qwen3-4B. To reproduce results for other models mentioned in the paper (e.g., Llama-3.1-8B-Instruction), you need to:
-
Download Model: Modify
model_idsinsrc/installation.ipynbto download the target model. -
Update Model Path: Update the
MODEL_NAMEvariable in the respectivesrc/claims/claim*.ipynbnotebooks. -
Update Output Path: Change the
subfolder_namevariable (e.g., tollama_8b) in the notebooks to ensure results are saved in a separate directory. -
Update Layer Count: Adjust the layer range loop (e.g.,
range(0, 36)) to match the total number of hidden layers of the new model (e.g., Qwen3-4B has 36 layers, while Llama-3.1-8B-Instruction has 32 layers).
đź”— Correspondence with Open Science Policy
In accordance with the USENIX Security 2026 open-science policy, this artifact fulfills the specific commitments made in the Open Science section of our paper:
| Open Science Commitment | Corresponding Artifact Component |
|---|---|
| 1. Source Code | src/ Folder: Contains the full codebase. The src/claims/ notebooks provide the probing framework implementation. |
| 2. Data Access | data/ Folder: Contains scripts and pre-processed files for training sets (NSFW-56k/GPT-4o) and benchmarks (I2P, Sneaky, MMA, Labelled). |
| 3. Probe Training | src/claims/claim1.ipynb: Contains the exact training logic and hyperparameters to train probes from scratch locally. |
| 4. Hidden States | src/claims/claim1.ipynb: Demonstrates on-the-fly extraction from local LLMs, verifying no reliance on cached proprietary tensors. |
| 5. Reproducibility | src/quick_start.py & src/claims/: Allows both instant verification of Table 2 data and full regeneration of results (Figure 4, Figure 5, Table 2). |
Files
_Security_26_AE__Quantifying_Large_Language_Model_Attacks_Through_the_Lens_of_Model_Cognition.pdf
Files
(1.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:c03a47ae53a8abe3c8ef0caef51a52c6
|
150.0 kB | Preview Download |
|
md5:f341e1b5b10a43b1f802db25336ed0fa
|
1.1 MB | Preview Download |
Additional details
Dates
- Available
-
2025-12-10
Software
- Repository URL
- https://github.com/lxmliu2002/LLM-Attack-Cognition-AE
- Programming language
- Python