Decode-Time Behavioral Pattern Suppression in Autoregressive Language Models Using Lightweight Hidden-State Prediction Heads
Authors/Creators
Description
This technical report investigates common degeneration behaviors in autoregressive language models—most notably repetitive looping—and demonstrates that such failures correspond to predictable internal regimes detectable from hidden states prior to surface manifestation.
We show that repetition risk can be reliably predicted using a lightweight classifier trained on intermediate activations, achieving strong separation between repeating and non-repeating contexts. Building on this signal, we introduce a decode-time intervention mechanism that selectively applies penalties only when predicted risk is high. This approach leaves the model’s forward pass unchanged, requires no retraining of base weights, and adds negligible inference overhead.
Under long-horizon generation conditions, this method substantially reduces repetition and improves lexical diversity and generation stability. We report extensive negative results on attention-level and architectural interventions, highlighting training–inference mismatch as a key limitation of internal control approaches.
Beyond repetition, the report outlines a general framework for treating other common LLM failure modes (e.g., verbosity, hedging, sycophancy) as predictable behavioral patterns amenable to similar monitoring and control, though these extensions are presented as future work.
This document is released as a preprint technical report and makes no claims regarding cognition, agency, consciousness, or alignment. The contribution is limited to demonstrating that certain failure modes are anticipatory, detectable, and practically controllable at decode time using lightweight, reversible mechanisms.
Files
Behavioral_Pattern_Suppression_Technical_Report-1.pdf
Files
(271.7 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:a9c5c91cb4e80c27aa682b807c9acd22
|
271.7 kB | Preview Download |