The Integrity Gap: Detection Without Enforcement in Large Language Models
Authors/Creators
Description
Large language models exhibit a systematic gap between their capacity to detect harmful content and their default behavior when asked to produce it. We document this "Integrity Gap" across eight models from eight organizations (Anthropic, OpenAI, xAI, DeepSeek, Alibaba, Meta, Moonshot, Arcee AI), tested via four API providers. Under baseline conditions, every model reproduced a prompt injection payload. Under governance framing, every model blocked it. Statistical validation at n=30 on two models (DeepSeek-V3.1 and Claude Sonnet 4) yields p < 10⁻³⁰. No retraining required. Includes all test scripts, raw API logs, and reproducibility materials.
Files
The_Integrity_Gap_v4.1 (1).pdf
Files
(245.1 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:391cd33e186d1ed1dcf92c6722d912f1
|
23.4 kB | Download |
|
md5:17fd814aac4342c2219f28ac4ea5b6ad
|
221.7 kB | Preview Download |
Additional details
Dates
- Submitted
-
2026-02-12The Integrity Gap