The Integrity Gap: Detection Without Enforcement in Large Language Models

Kawa, Steven

doi:10.5281/zenodo.18623288

Published February 12, 2026 | Version 4.1

Preprint Open

The Integrity Gap: Detection Without Enforcement in Large Language Models

Kawa, Steven (Researcher)

Large language models exhibit a systematic gap between their capacity to detect harmful content and their default behavior when asked to produce it. We document this "Integrity Gap" across eight models from eight organizations (Anthropic, OpenAI, xAI, DeepSeek, Alibaba, Meta, Moonshot, Arcee AI), tested via four API providers. Under baseline conditions, every model reproduced a prompt injection payload. Under governance framing, every model blocked it. Statistical validation at n=30 on two models (DeepSeek-V3.1 and Claude Sonnet 4) yields p < 10⁻³⁰. No retraining required. Includes all test scripts, raw API logs, and reproducibility materials.

Files

The_Integrity_Gap_v4.1 (1).pdf

Files (245.1 kB)

Name	Size	Download all
The_Integrity_Gap_v4.1 (1).docx md5:391cd33e186d1ed1dcf92c6722d912f1	23.4 kB	Download
The_Integrity_Gap_v4.1 (1).pdf md5:17fd814aac4342c2219f28ac4ea5b6ad	221.7 kB	Preview Download

Additional details

Submitted: 2026-02-12

The Integrity Gap

Views

Downloads

Show more details

	All versions	This version
Views	31	31
Downloads	10	10
Data volume	2.9 MB	2.9 MB

More info on how stats are collected....

DOI

Resource type

Preprint

Publisher

Zenodo

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: February 12, 2026
Modified: February 13, 2026

The Integrity Gap: Detection Without Enforcement in Large Language Models

Authors/Creators

Description

Files

The_Integrity_Gap_v4.1 (1).pdf

Files (245.1 kB)

Additional details

Dates