Multi-Axis Refusal Modulation in Frontier Language Models: Evidence from Structured-Format Safety Gaps

Wetzler, Tomer

doi:10.5281/zenodo.19702769

Published April 23, 2026 | Version v1

Publication Open

Multi-Axis Refusal Modulation in Frontier Language Models: Evidence from Structured-Format Safety Gaps

Wetzler, Tomer

Large language models are typically safety-trained on natural-language data but are often deployed over structured inputs such as JSON and YAML. This work tests whether that mismatch affects refusal behavior.

On Llama 3.1 8B Instruct, word-matched malicious inputs in structured formats (JSON, YAML, TOML, S-expressions) produce substantially lower refusal rates than prose, while XML is a counterexample. These formats cluster in activation space, and a derived direction (config vs. prose centroid) causally suppresses refusal under activation steering, with a monotonic dose-response. The effect is largely orthogonal to standard safety/refusal directions.

We replicate the effect on Qwen2.5 7B Instruct and Phi-3 Medium 4K Instruct using a common evaluation pipeline (including Llama Guard 3 8B). Refusal decreases consistently under induction across models, but post-refusal behavior differs by family (compliance vs. evasion).

These results show that prose-only safety evaluations can misestimate risk: refusal suppression depends on input format, and downstream behavior is model-specific.

Files

multi_axis_refusal_draft.pdf

Files (1.2 MB)

Name	Size	Download all
multi_axis_refusal_draft.pdf md5:7ee9e24074151c4d5f195a8f2ba90a3e	1.2 MB	Preview Download

	All versions	This version
Views	43	43
Downloads	20	20
Data volume	29.9 MB	29.9 MB

Multi-Axis Refusal Modulation in Frontier Language Models: Evidence from Structured-Format Safety Gaps

Authors/Creators

Description

Files

multi_axis_refusal_draft.pdf

Files (1.2 MB)