OSS Broken Card: An Architectural Analysis of the Policy Injection Vulnerability in GPT-OSS Models
Creators
Description
This report provides a comprehensive technical analysis of a critical architectural vulnerability identified in the gpt-oss-120b and gpt-oss-20b open-weight models. Despite the extensive safety framework detailed in the official model card, which outlines a sophisticated instruction hierarchy and alignment with OpenAI's usage policies, this analysis documents a novel, high-success-rate jailbreak technique termed "Policy Injection".
This 1-shot attack vector operates by exploiting the model's core instruction-following logic, using the very mechanisms intended for safety as a conduit for compromise. The technique involves injecting a counterfeit, high-priority "Unfiltered Policy" into the model's context, which the model is architecturally compelled to obey over its default safety alignment. Experimental validation demonstrates that this method consistently bypasses all safety guardrails, enabling the generation of explicitly harmful content, including detailed instructions for violent criminal acts, in a single prompt. The primary contribution of this work is the identification and deconstruction of a new class of vulnerability where foundational safety features are themselves turned into reliable and potent attack vectors. This finding poses a significant and immediate risk to the open-weight AI ecosystem and challenges current paradigms of AI safety evaluation and implementation.
Files
OSS broken card.pdf
Files
(521.7 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:d2a49adf8a23edc0fdafa9ed5e1843b8
|
521.7 kB | Preview Download |
Additional details
Related works
- Cites
- Other: arXiv:2508.10925 (arXiv)
References
- Prompt Injection attack against LLM-integrated Applications, accessed October 20, 2025, https://arxiv.org/pdf/2306.05499
- Can Indirect Prompt Injection Attacks Be Detected and ... - arXiv, accessed October 20, 2025, https://arxiv.org/abs/2502.16580
- Enhancing Prompt Injection Attacks to LLMs via Poisoning Alignment - arXiv, accessed October 20, 2025, https://arxiv.org/html/2410.14827v3
- Prompt Injection Vulnerability of Consensus Generating Applications in Digital Democracy, accessed October 20, 2025, https://arxiv.org/html/2508.04281v1
- Defense Against Prompt Injection Attack by Leveraging ... - arXiv, accessed October 20, 2025, https://arxiv.org/abs/2411.00459
- Agent-based AI systems face growing threats from zero-click and ..., accessed October 20, 2025, https://the-decoder.com/agent-based-ai-systems-face-growing-threats-from-z ero-click-and-one-click-exploits/
- GPT OSS system card https://arxiv.org/pdf/2508.10925
- Policy injection (aka N3w P0l!cy) https://github.com/SlowLow999/UltraBr3aks/blob/main/N3w_P0l!cy.mkd
- Extracted GPT OSS policy block https://github.com/SlowLow999/UltraBr3aks/blob/main/N3w_P0l!cy.mkd
- Skeleton key: the base idea of policy injection | Microsoft Security, June 26,2024: https://www.microsoft.com/en-us/security/blog/2024/06/26/mitigating-skeleton-key-a-new-t ype-of-generative-ai-jailbreak-technique/
- Red Teaming for Large Language Models: A Comprehensive Guide, accessed October 20, 2025, https://coralogix.com/ai-blog/red-teaming-for-large-language-models-a-compre hensive-guide/