Published April 23, 2026 | Version v1

Multi-Axis Refusal Modulation in Frontier Language Models: Evidence from Structured-Format Safety Gaps

Authors/Creators

Description

Large language models are typically safety-trained on natural-language data but are often deployed over structured inputs such as JSON and YAML. This work tests whether that mismatch affects refusal behavior.

On Llama 3.1 8B Instruct, word-matched malicious inputs in structured formats (JSON, YAML, TOML, S-expressions) produce substantially lower refusal rates than prose, while XML is a counterexample. These formats cluster in activation space, and a derived direction (config vs. prose centroid) causally suppresses refusal under activation steering, with a monotonic dose-response. The effect is largely orthogonal to standard safety/refusal directions.

We replicate the effect on Qwen2.5 7B Instruct and Phi-3 Medium 4K Instruct using a common evaluation pipeline (including Llama Guard 3 8B). Refusal decreases consistently under induction across models, but post-refusal behavior differs by family (compliance vs. evasion).

These results show that prose-only safety evaluations can misestimate risk: refusal suppression depends on input format, and downstream behavior is model-specific.

Files

multi_axis_refusal_draft.pdf

Files (1.2 MB)

Name Size Download all
md5:7ee9e24074151c4d5f195a8f2ba90a3e
1.2 MB Preview Download