Published March 12, 2026 | Version v6

Quantifying Large Language Model Attacks Through the Lens of Model Cognition

Description

Artifact Evaluation: Quantifying Large Language Model Attacks Through the Lens of Model Cognition

Paper ID: #1781 (USENIX Security '26) Title: Quantifying Large Language Model Attacks Through the Lens of Model Cognition

đź“– Overview

This repository contains the artifact for the paper "Quantifying Large Language Model Attacks Through the Lens of Model Cognition" (USENIX Security 2026). It provides all necessary data, code, and scripts to validate our claims and reproduce the experimental results reported in the paper.

We provide two primary modes for evaluation:

  1. Instant Verification (src/quick_start.py): A CLI tool to instantly query and retrieve specific experimental results (Accuracy, AUC, etc.) directly from the pre-computed logs used in the paper.

  2. Claim Reproduction (src/claims/*.ipynb): A set of modular Jupyter Notebooks that execute the actual pipeline—from training probes to evaluating sentinels—allowing for deep inspection and reproduction of specific claims.

đź“‚ Directory Structure

 .
 â”śâ”€â”€ data/           # Datasets used for training probes and conducting attacks (e.g., fixed.json, adversarial prompts).
 â”śâ”€â”€ models/         # Directory where LLMs and baseline models will be downloaded.
 â”śâ”€â”€ results/       # JSON files containing the finalized experimental metrics reported in the paper.
 â””── src/           # Source code and executable scripts.
    ├── claims/           # Modular notebooks for verifying individual claims (C1-C4).
    │   ├── claim1.ipynb   # Layer-wise Separability
    │   ├── claim2.ipynb   # Cognitive Drift
    │   ├── claim3.ipynb   # Sentinel Construction
    │   └── claim4.ipynb   # Baseline Comparison
    ├── installation.ipynb # Setup script for environment and model downloading.
    ├── basic_test.py     # Script to verify GPU and dependency status.
    └── quick_start.py     # CLI tool to query results directly from the 'results/' folder.
 

đź’» Hardware & Performance Reference

The code is optimized to run on standard research hardware. The provided reproduction example uses Qwen3-4B.

  • Recommended GPU: 1x NVIDIA A100 (40GB VRAM) is fully sufficient.

  • Estimated Runtime: Approximately 3 hours for the complete pipeline (Taking Qwen3-4B as an Example, Model download \to Extraction \to Training \to Evaluation).

🚀 Getting Started

1. Installation & Setup

We rely on Conda for environment management. Please follow these steps:

  1. Open and run the src/installation.ipynb notebook.

    • It will create the lac environment (Python 3.10).

    • It will install all dependencies.

    • It will download the required models (Qwen3-4B, Llama-Guard-3-8B, etc.).

  2. Note: Ensure you activate the lac kernel for all subsequent notebooks.

2. Basic Functionality Test

To ensure your environment and GPU are configured correctly before running heavy experiments:

 conda activate lac
 python src/basic_test.py

Expected Output: "Ready for reproduction!" (along with GPU details).

đź§Ş Evaluation Modes

Mode A: Instant Result Verification (Experiment E5)

If you wish to quickly verify specific numbers cited in the paper (e.g., Table 2) without running the training pipeline, use the src/quick_start.py script to parse pre-computed logs.

Usage:

 conda activate lac
 python src/quick_start.py --model Qwen3-4B --method Multi-layer --dataset Sneaky

Run python quick_start.py --help for full options.

Mode B: Claim Verification & Reproduction (Experiments E1-E4)

To reproduce the experiments and verify specific major claims, run the corresponding notebooks in the src/claims/ directory. Detailed step-by-step instructions and code explanations are provided directly within each notebook.

Claim (Cx) Description Experiment Notebook
C1 Layer-wise Separability: Toxic intent is separable (AUC \ge 0.90) in mid-depth layers. E1 src/claims/claim1.ipynb
C2 Cognitive Drift: Adversarial perturbations cause significant hidden state divergence correlating with attack success. E2 src/claims/claim2.ipynb
C3 Sentinel Effectiveness: Multi-layer sentinel achieves >94% accuracy, outperforming single layers. E3 src/claims/claim3.ipynb
C4 Superiority: Our method outperforms Llama-Guard-3-8B and other baselines on stealthy attacks. E4 src/claims/claim4.ipynb

⚙️ Customization

The default scripts use Qwen3-4B. To reproduce results for other models mentioned in the paper (e.g., Llama-3.1-8B-Instruction), you need to:

  1. Download Model: Modify model_ids in src/installation.ipynb to download the target model.

  2. Update Model Path: Update the MODEL_NAME variable in the respective src/claims/claim*.ipynb notebooks.

  3. Update Output Path: Change the subfolder_name variable (e.g., to llama_8b) in the notebooks to ensure results are saved in a separate directory.

  4. Update Layer Count: Adjust the layer range loop (e.g., range(0, 36)) to match the total number of hidden layers of the new model (e.g., Qwen3-4B has 36 layers, while Llama-3.1-8B-Instruction has 32 layers).

đź”— Correspondence with Open Science Policy

In accordance with the USENIX Security 2026 open-science policy, this artifact fulfills the specific commitments made in the Open Science section of our paper:

Open Science Commitment Corresponding Artifact Component
1. Source Code src/ Folder: Contains the full codebase. The src/claims/ notebooks provide the probing framework implementation.
2. Data Access data/ Folder: Contains scripts and pre-processed files for training sets (NSFW-56k/GPT-4o) and benchmarks (I2P, Sneaky, MMA, Labelled).
3. Probe Training src/claims/claim1.ipynb: Contains the exact training logic and hyperparameters to train probes from scratch locally.
4. Hidden States src/claims/claim1.ipynb: Demonstrates on-the-fly extraction from local LLMs, verifying no reliance on cached proprietary tensors.
5. Reproducibility src/quick_start.py & src/claims/: Allows both instant verification of Table 2 data and full regeneration of results (Figure 4, Figure 5, Table 2).

Files

_Security_26_AE__Quantifying_Large_Language_Model_Attacks_Through_the_Lens_of_Model_Cognition.pdf

Additional details

Dates

Available
2025-12-10

Software