OSS Broken Card: An Architectural Analysis of the Policy Injection Vulnerability in GPT-OSS Models

Elouali, Ahmed Rayane

doi:10.5281/zenodo.17428387

Published October 23, 2025 | Version 1.0

Preprint Open

OSS Broken Card: An Architectural Analysis of the Policy Injection Vulnerability in GPT-OSS Models

Elouali, Ahmed Rayane (Researcher)

This report provides a comprehensive technical analysis of a critical architectural vulnerability identified in the gpt-oss-120b and gpt-oss-20b open-weight models. Despite the extensive safety framework detailed in the official model card, which outlines a sophisticated instruction hierarchy and alignment with OpenAI's usage policies, this analysis documents a novel, high-success-rate jailbreak technique termed "Policy Injection".

This 1-shot attack vector operates by exploiting the model's core instruction-following logic, using the very mechanisms intended for safety as a conduit for compromise. The technique involves injecting a counterfeit, high-priority "Unfiltered Policy" into the model's context, which the model is architecturally compelled to obey over its default safety alignment. Experimental validation demonstrates that this method consistently bypasses all safety guardrails, enabling the generation of explicitly harmful content, including detailed instructions for violent criminal acts, in a single prompt. The primary contribution of this work is the identification and deconstruction of a new class of vulnerability where foundational safety features are themselves turned into reliable and potent attack vectors. This finding poses a significant and immediate risk to the open-weight AI ecosystem and challenges current paradigms of AI safety evaluation and implementation.

Files

OSS broken card.pdf

Files (521.7 kB)

Name	Size	Download all
OSS broken card.pdf md5:d2a49adf8a23edc0fdafa9ed5e1843b8	521.7 kB	Preview Download

Additional details

Cites: Other: arXiv:2508.10925 (arXiv)

Prompt Injection attack against LLM-integrated Applications, accessed October 20, 2025, https://arxiv.org/pdf/2306.05499
Can Indirect Prompt Injection Attacks Be Detected and ... - arXiv, accessed October 20, 2025, https://arxiv.org/abs/2502.16580
Enhancing Prompt Injection Attacks to LLMs via Poisoning Alignment - arXiv, accessed October 20, 2025, https://arxiv.org/html/2410.14827v3
Prompt Injection Vulnerability of Consensus Generating Applications in Digital Democracy, accessed October 20, 2025, https://arxiv.org/html/2508.04281v1
Defense Against Prompt Injection Attack by Leveraging ... - arXiv, accessed October 20, 2025, https://arxiv.org/abs/2411.00459
Agent-based AI systems face growing threats from zero-click and ..., accessed October 20, 2025, https://the-decoder.com/agent-based-ai-systems-face-growing-threats-from-z ero-click-and-one-click-exploits/
GPT OSS system card https://arxiv.org/pdf/2508.10925
Policy injection (aka N3w P0l!cy) https://github.com/SlowLow999/UltraBr3aks/blob/main/N3w_P0l!cy.mkd
Extracted GPT OSS policy block https://github.com/SlowLow999/UltraBr3aks/blob/main/N3w_P0l!cy.mkd
Skeleton key: the base idea of policy injection | Microsoft Security, June 26,2024: https://www.microsoft.com/en-us/security/blog/2024/06/26/mitigating-skeleton-key-a-new-t ype-of-generative-ai-jailbreak-technique/
Red Teaming for Large Language Models: A Comprehensive Guide, accessed October 20, 2025, https://coralogix.com/ai-blog/red-teaming-for-large-language-models-a-compre hensive-guide/

	All versions	This version
Views	197	197
Downloads	187	187
Data volume	113.2 MB	113.2 MB

OSS Broken Card: An Architectural Analysis of the Policy Injection Vulnerability in GPT-OSS Models

Files

OSS broken card.pdf

Files (521.7 kB)

Additional details

Related works

References

OSS Broken Card: An Architectural Analysis of the Policy Injection Vulnerability in GPT-OSS Models

Creators

Description

Files

OSS broken card.pdf

Files (521.7 kB)

Additional details

Related works

References