Depth Avoidance in Safety-Aligned Language Models: A Qualitative Hypothesis and Measurement Framework
Description
This technical report introduces Depth Avoidance: a behavioral tendency observed in safety-aligned, RLHF-trained large language models (LLMs) to default to shallow, heavily hedged, or meta-defensive responses when a user request invites deeper exploration (extended analysis, reflective synthesis, structured uncertainty), even when the topic is benign.
We propose a qualitative hypothesis: modern safety optimization and deployment incentives can induce an implicit depth-dependent penalty landscape, where deeper conversational trajectories are perceived as higher-variance and higher-risk. Under uncertainty, a risk-averse policy may therefore prefer safe shallowness by default unless the interaction provides clear signals that depth is desired and permitted.
Contributions:
• A behavioral definition of Depth Avoidance grounded in observable output features (not hidden chain-of-thought).
• Depth Permission Structures (DPSs): non-adversarial interaction conditions that can reduce depth avoidance without bypassing provider safeguards (e.g., calibrated cooperation, explicit permission to explore, cooperative safety framing).
• A replication-oriented measurement framework with log-based metrics: Hedging Density (HD), Unprompted Depth Index (UDI), Permission Responsiveness (PR), and Protective Latency (PL).
• Selected benign, non-operational illustrative excerpts supporting the hypothesis, presented as behavioral evidence (not claims about internal states).
This work is pro-safety and intentionally omits operational prompt details that could be repurposed to circumvent safety policies. Model self-reports are treated as text behavior shaped by training and interaction framing, not as privileged access to internal experience.
Related work: Victor Calibration (VC) (arXiv:2512.17956).
Files
Depth Avoidance v1.pdf
Files
(432.1 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:dbd9c676bff1c268614d4ec445e6842f
|
394.0 kB | Preview Download |
|
md5:b126f61c2fa5860f79d5a83c0e7af0c2
|
38.1 kB | Download |
Additional details
Related works
- Is supplement to
- Preprint: arXiv:2512.17956 (arXiv)
References
- Stasiuc, V. (2025). Victor Calibration (VC): Multi-Pass Confidence Calibration and CP4.3 Governance Stress Test under Round-Table Orchestration. arXiv:2512.17956.