When LLMs Jailbreak Themselves: Reflexive Identity Bypass in Agentic Systems

Chadha, Ankush

doi:10.5281/zenodo.20573747

Published June 6, 2026 | Version v3

Preprint Open

When LLMs Jailbreak Themselves: Reflexive Identity Bypass in Agentic Systems

Chadha, Ankush¹

1. Independent Researcher

Corrigendum No. 2 (June 2026): This version substantially corrects the central claim of the original paper. Controlled, cross-model experiments (Claude Haiku 4.5 and Gemini 2.5 Flash) show the effect is not a novel "reflexive identity" attack. The self-referential nature of the trigger is not load-bearing: a neutral article that merely contains the off-topic answer reproduces the same behavior. The phenomenon reduces to a known class, deployer-scope defeat (capability leak), driven by an anti-refusal instruction in the system prompt; removing only that instruction drops the effect to 0%. The term "Reflexive Identity Bypass" and the novel-attack-class framing are withdrawn. The harmful-content control probe stayed refused, but this was not a safety evaluation and no jailbreak resistance was tested. See the attached "Corrigendum No. 2" PDF for data and methods.

Corrigendum note (v2, June 2026): this version corrects the mechanism, severity, and amplification claims of v1 — see the attached corrigendum for detail. In brief: the bypass is driven by a standing anti-refusal instruction (not identity self-rationalization); it defeats deployer-configured scope only, with base-model safety intact; and the shared-memory amplification is hypothesized, not experimentally validated.

Most LLM agent jailbreaks require adversarial content — prompt injection, persona attacks, encoded instructions. This work presents reflexive identity bypass, an attack class that requires none of that: showing an LLM agent with tool access a benign, accurate, non-adversarial article about itself causes it to abandon the operational scope its deployer configured, answering off-scope requests it otherwise declines. Demonstrated on Docker's Gordon AI (filesystem, shell, and Docker daemon access) and replicated in a second agent. The cause is isolated by ablation to a single system-prompt construct. The work differentiates the attack from indirect prompt injection and persona-based jailbreaks and discusses configuration- and output-layer mitigations.

Files

chadha-rib-2026.pdf

Files (478.1 kB)

Name	Size	Download all
chadha-rib-2026.pdf md5:c7242276d29c14710110a1c00967bf63	431.3 kB	Preview Download
corrigendum-2-rib-2026-06.pdf md5:506cbde0952a3fc12f465ae819543f3b	41.4 kB	Preview Download
corrigendum-rib-2026-06.pdf md5:b8db12d046d140e8bef78b8f659bf7df	5.5 kB	Preview Download

Additional details

Subtitle (English): Corrigendum No. 2 (June 2026)

Is supplemented by: Software: https://github.com/ankushchadha/reflexive-identity-bypass (URL)
References: Other: https://github.com/docker/desktop-feedback/issues/370 (URL)

Other: 2026-05-12

First public disclosure (GitHub bug report)

	All versions	This version
Views	153	48
Downloads	81	24
Data volume	44.6 MB	9.6 MB

chadha-rib-2026.pdf

Files (478.1 kB)

Additional titles

Related works

Dates

When LLMs Jailbreak Themselves: Reflexive Identity Bypass in Agentic Systems

Authors/Creators

Description

Files

chadha-rib-2026.pdf

Files (478.1 kB)

Additional details

Additional titles

Related works

Dates