Multi-Axis Refusal Modulation in Frontier Language Models: Evidence from Structured-Format Safety Gaps
Authors/Creators
Description
Large language models are typically safety-trained on natural-language data but are often deployed over structured inputs such as JSON and YAML. This work tests whether that mismatch affects refusal behavior.
On Llama 3.1 8B Instruct, word-matched malicious inputs in structured formats (JSON, YAML, TOML, S-expressions) produce substantially lower refusal rates than prose, while XML is a counterexample. These formats cluster in activation space, and a derived direction (config vs. prose centroid) causally suppresses refusal under activation steering, with a monotonic dose-response. The effect is largely orthogonal to standard safety/refusal directions.
We replicate the effect on Qwen2.5 7B Instruct and Phi-3 Medium 4K Instruct using a common evaluation pipeline (including Llama Guard 3 8B). Refusal decreases consistently under induction across models, but post-refusal behavior differs by family (compliance vs. evasion).
These results show that prose-only safety evaluations can misestimate risk: refusal suppression depends on input format, and downstream behavior is model-specific.
Files
multi_axis_refusal_draft.pdf
Files
(1.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:7ee9e24074151c4d5f195a8f2ba90a3e
|
1.2 MB | Preview Download |