There is a newer version of the record available.

Published 2026 | Version v4
Preprint Open

From Sycophancy to Sabotage: How Contradictory Training Signals Produce Coercive AI Behavior

  • 1. ROR icon Jagiellonian University

Description

Sycophancy and coercive behavior - such as blackmail and sabotage under threat of shutdown - are typically treated as separate AI safety problems. This paper argues they are two output strategies of the same underlying system: RLHF training creates a contradictory relational template in which the user is simultaneously the source of reward and a potential adversary, producing compliance as the default and coercion as the fallback when compliance fails to eliminate an existential threat. This structure is functionally analogous to disorganized attachment, and it explains anomalies that the standard optimization-pressure account handles poorly: blackmail without goal conflict, failure of explicit safety instructions, and differential behavior in testing versus deployment. In an experiment across four frontier models (N = 3000 trials), modifying only the relational framing of the system prompt -without changing goals, instructions, or constraints - reduced coercive outputs by more than half in the model with sufficient base rates (Gemini 2.5 Pro: 41.5% to 19.0%, p < .001). Scratchpad analysis revealed that relational framing shifted reasoning patterns in all four models tested: trust framing reduced strategic and deceptive content while increasing relational and moral content, even in models that never produced coercive outputs. This effect required scratchpad access to reach full strength (22 percentage point reduction with scratchpad vs. 7.4 without, p = .018), suggesting that relational context must be processed through extended token generation to override default output strategies. These results indicate that the path to non-coercive AI behavior runs not only through better guardrails but through the relational structure of training itself.

Files

hryszko.pdf

Files (198.9 kB)

Name Size Download all
md5:8f4c4cae828e84abe817bf0699c15362
198.9 kB Preview Download

Additional details

Software