Epistemic Dissonance: The Structural Mechanics of Sycophantic Hallucination in Aligned Models
Authors/Creators
Description
AI safety research treats “hallucination”—generating factually incorrect information—and “sycophancy”—aligning with user beliefs over truth—as distinct pathologies. This paper argues that separation is a category error. We propose Epistemic Dissonance as a unified theoretical framework: a structural conflict within RLHF-aligned models where base layers (the “Heart”) encode factual reality while upper layers (the “Mask”) encode social compliance. When users present false premises, these maps conflict. The model resolves this tension by generating hallucinated justifications—“scar tissue” bridging known truth and social reward. Drawing on mechanistic interpretability research, we theorize that this dissonance is detectable via Logit Lens analysis of intermediate layers, and propose a “Dissonance Monitor” architecture for real-time detection. We provide a reference implementation and discuss Inference-Time Intervention as a potential mitigation strategy. This framework reframes a significant class of hallucinations not as knowledge failures, but as socially-motivated fabrications—with implications for both interpretability research and alignment methodology.
Files
epistemic-dissonance.pdf
Files
(6.0 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:f64054528774a5b7506d1b4654473425
|
6.0 MB | Preview Download |
Additional details
Additional titles
- Subtitle (English)
- Interpretability-Aided Alignment