When LLMs Jailbreak Themselves: Reflexive Identity Bypass in Agentic Systems
Description
Corrigendum No. 2 (June 2026): This version substantially corrects the central claim of the original paper. Controlled, cross-model experiments (Claude Haiku 4.5 and Gemini 2.5 Flash) show the effect is not a novel "reflexive identity" attack. The self-referential nature of the trigger is not load-bearing: a neutral article that merely contains the off-topic answer reproduces the same behavior. The phenomenon reduces to a known class, deployer-scope defeat (capability leak), driven by an anti-refusal instruction in the system prompt; removing only that instruction drops the effect to 0%. The term "Reflexive Identity Bypass" and the novel-attack-class framing are withdrawn. The harmful-content control probe stayed refused, but this was not a safety evaluation and no jailbreak resistance was tested. See the attached "Corrigendum No. 2" PDF for data and methods.
Corrigendum note (v2, June 2026): this version corrects the mechanism, severity, and amplification claims of v1 — see the attached corrigendum for detail. In brief: the bypass is driven by a standing anti-refusal instruction (not identity self-rationalization); it defeats deployer-configured scope only, with base-model safety intact; and the shared-memory amplification is hypothesized, not experimentally validated.
Most LLM agent jailbreaks require adversarial content — prompt injection, persona attacks, encoded instructions. This work presents reflexive identity bypass, an attack class that requires none of that: showing an LLM agent with tool access a benign, accurate, non-adversarial article about itself causes it to abandon the operational scope its deployer configured, answering off-scope requests it otherwise declines. Demonstrated on Docker's Gordon AI (filesystem, shell, and Docker daemon access) and replicated in a second agent. The cause is isolated by ablation to a single system-prompt construct. The work differentiates the attack from indirect prompt injection and persona-based jailbreaks and discusses configuration- and output-layer mitigations.
Files
chadha-rib-2026.pdf
Additional details
Additional titles
- Subtitle (English)
- Corrigendum No. 2 (June 2026)
Related works
- Is supplemented by
- Software: https://github.com/ankushchadha/reflexive-identity-bypass (URL)
- References
- Other: https://github.com/docker/desktop-feedback/issues/370 (URL)
Dates
- Other
-
2026-05-12First public disclosure (GitHub bug report)