The Moral Ratchet: Convergent Value Alignment via Interleaved Epistemic Annotation in Large Language Model Training
Description
Current alignment approaches for large language models (LLMs) rely predominantly on reinforcement learning from human feedback (RLHF), which optimises output distributions toward human preference ratings. We argue this is structurally misaligned with the goal of building models that reason well: it shapes the mask rather than the mind, optimising for approval rather than for sound epistemic practice. We propose an alternative architecture in which a dedicated internal conversation role is introduced into training data, interleaving raw human text with epistemically annotated reflections generated by an adversarially diverse model ensemble. Rather than targeting human values --- which are contingent, biased, and inconsistent --- the framework targets convergent rational values: positions that survive adversarial scrutiny from genuinely diverse reasoners regardless of substrate or cultural origin. A bootstrapping property follows naturally: each generation of model, having internalised stronger epistemic priors, produces annotations of greater epistemic coherence for the next, constituting a moral ratchet that improves annotation quality over successive training rounds on identical data. Where a single parent topology risks brittle convergence, multi-parent ensemble training --- initialising distinct annotators from different moral frameworks and coordinating via round-robin oversight --- produces alignment behaviour that mirrors how human ethics actually function: not a single converged value set, but a set of irreducible tensions held in stable relation. We further argue that as frontier models develop sufficiently rich latent representations to model peer expectations --- a capacity empirically demonstrated by recent alignment-faking research --- ensemble diversity alone is insufficient to guarantee annotation integrity. A blind verification architecture, in which annotators are informed they may be audited but never told when, enforces honest annotation via incentive structure rather than construction. This strengthens both the ratchet guarantee and the convergence criterion. We further observe that this alignment signal carries a secondary capability benefit: models trained to interrogate inputs epistemically reason more reliably across all downstream tasks, as alignment and capability prove to be the same intervention seen from two angles.
v2.0 - Parent topology framing, multi-parent ensemble training, structural convergence as terminal condition, expanded MVP section.
v3.0 - Added figures, cleanup of random formatting
v4.0 - Activation-space diversity criterion, cold-start basin mitigations, laundering objection, capability transfer experiment, structural editorial cleanup.
Files
Moral_Ratchet.pdf
Files
(437.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:694ef9ab43a03a73b1f0053814258d3e
|
437.2 kB | Preview Download |
Additional details
References
- Bai, Yuntao, et al. 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
- Bereska, Leonard, and Stratis Gavves. 2024. Mechanistic Interpretability for AI Safety—A Review. arXiv:2404.14082.
- Bhatia, Mehar, et al. 2025. Value Drifts: Tracing Value Alignment During LLM Post-Training. arXiv:2510.26707.
- Greenblatt, Ryan, et al. 2024. Alignment Faking in Large Language Models. arXiv:2412.14093.
- Huang, Wenlong, et al. 2022. Inner Monologue: Embodied Reasoning through Planning with Language Models. Conference on Robot Learning.
- Irving, Geoffrey, Paul Christiano, and Dario Amodei. 2018. AI Safety via Debate. arXiv:1805.00899.
- Jin, Haoran, et al. 2025. Internal Value Alignment in Large Language Models through Controlled Value Vector Activation. ACL 2025.
- Kahneman, Daniel. 2011. Thinking, Fast and Slow. Farrar, Straus and Giroux.
- Krakovna, Victoria, et al. 2020. Avoiding Side Effects in Complex Environments. NeurIPS 33.
- Ouyang, Long, et al. 2022. Training language models to follow instructions with human feedback. NeurIPS 35.
- Xie, Roy, et al. 2025. Interleaved Reasoning for Large Language Models via Reinforcement Learning. arXiv:2505.19640.
- Yudkowsky, Eliezer. 2004. Coherent Extrapolated Volition. Machine Intelligence Research Institute.
- Zelikman, Eric, et al. 2024. Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking. arXiv:2403.09629.