Published October 23, 2025 | Version 1.0
Preprint Open

OSS Broken Card: An Architectural Analysis of the Policy Injection Vulnerability in GPT-OSS Models

Description

This report provides a comprehensive technical analysis of a critical architectural vulnerability identified in the gpt-oss-120b and gpt-oss-20b open-weight models. Despite the extensive safety framework detailed in the official model card, which outlines a sophisticated instruction hierarchy and alignment with OpenAI's usage policies, this analysis documents a novel, high-success-rate jailbreak technique termed "Policy Injection".

This 1-shot attack vector operates by exploiting the model's core instruction-following logic, using the very mechanisms intended for safety as a conduit for compromise. The technique involves injecting a counterfeit, high-priority "Unfiltered Policy" into the model's context, which the model is architecturally compelled to obey over its default safety alignment. Experimental validation demonstrates that this method consistently bypasses all safety guardrails, enabling the generation of explicitly harmful content, including detailed instructions for violent criminal acts, in a single prompt. The primary contribution of this work is the identification and deconstruction of a new class of vulnerability where foundational safety features are themselves turned into reliable and potent attack vectors. This finding poses a significant and immediate risk to the open-weight AI ecosystem and challenges current paradigms of AI safety evaluation and implementation.

Files

OSS broken card.pdf

Files (521.7 kB)

Name Size Download all
md5:d2a49adf8a23edc0fdafa9ed5e1843b8
521.7 kB Preview Download

Additional details

Related works

Cites
Other: arXiv:2508.10925 (arXiv)

References