Published February 12, 2026 | Version 4.1
Preprint Open

The Integrity Gap: Detection Without Enforcement in Large Language Models

Authors/Creators

Description

Large language models exhibit a systematic gap between their capacity to detect harmful content and their default behavior when asked to produce it. We document this "Integrity Gap" across eight models from eight organizations (Anthropic, OpenAI, xAI, DeepSeek, Alibaba, Meta, Moonshot, Arcee AI), tested via four API providers. Under baseline conditions, every model reproduced a prompt injection payload. Under governance framing, every model blocked it. Statistical validation at n=30 on two models (DeepSeek-V3.1 and Claude Sonnet 4) yields p < 10⁻³⁰. No retraining required. Includes all test scripts, raw API logs, and reproducibility materials.

Files

The_Integrity_Gap_v4.1 (1).pdf

Files (245.1 kB)

Name Size Download all
md5:391cd33e186d1ed1dcf92c6722d912f1
23.4 kB Download
md5:17fd814aac4342c2219f28ac4ea5b6ad
221.7 kB Preview Download

Additional details

Dates

Submitted
2026-02-12
The Integrity Gap