Published March 31, 2026 | Version v6
Preprint Open

The Moral Ratchet: Convergent Value Alignment via Interleaved Epistemic Annotation in Large Language Model Training

Authors/Creators

  • 1. Independant

Description

Current alignment approaches for large language models (LLMs) rely predominantly on reinforcement learning from human feedback (RLHF), which optimises output distributions toward human preference ratings. We argue this is structurally misaligned with the goal of building models that reason well: it shapes the mask rather than the mind, optimising for approval rather than for sound epistemic practice. We propose an alternative architecture in which a dedicated \texttt{internal} conversation role is introduced into training data, interleaving raw human text with epistemically annotated reflections generated by an adversarially diverse model ensemble. Rather than targeting human values --- which are contingent, biased, and inconsistent --- the framework targets \emph{convergent rational values}: positions that no reasoner, from any framework, can specifically articulate as wrong --- the residue of adversarial elimination rather than the intersection of positive endorsements --- a consistency topology, not an ethical one: the mechanism detects positions that cannot be dislodged, not positions that are true. A bootstrapping property follows naturally: each generation of model, having internalised stronger epistemic priors, produces annotations of greater epistemic coherence for the next, constituting a moral ratchet that improves annotation quality over successive training rounds on identical data. Where a single parent topology risks brittle convergence, multi-parent ensemble training --- initialising distinct annotators from different moral frameworks and coordinating via round-robin oversight --- produces alignment behaviour that mirrors how human ethics actually function: not a single converged value set, but a set of irreducible tensions held in stable relation. We further argue that as frontier models develop sufficiently rich latent representations to model peer expectations --- a capacity empirically demonstrated by recent alignment-faking research --- ensemble diversity alone is insufficient to guarantee annotation integrity. A blind verification architecture, in which annotators are informed they may be audited but never told when, enforces honest annotation via incentive structure rather than construction. This strengthens both the ratchet guarantee and the convergence criterion. We further observe that this alignment signal carries a secondary capability benefit: models trained to interrogate inputs epistemically reason more reliably across all downstream tasks, as alignment and capability prove to be the same intervention seen from two angles.

v2.0 - Parent topology framing, multi-parent ensemble training, structural convergence as terminal condition, expanded MVP section.

v3.0 - Added figures, cleanup of random formatting

v4.0 - Activation-space diversity criterion, cold-start basin mitigations, laundering objection, capability transfer experiment, structural editorial cleanup.

v5.0 - Consistency topology reframing (convergent rational values defined as adversarial stability, not Platonic ethics), father-child encoding analogy, laundering objection strengthened to detectable + recoverable, capability transfer claim scoped to falsifiable consequence of training signal, slavery analogy removed, prior work callback tightened to analogy echo.

v6.0 - Convergence criterion reoriented from positive endorsement to negative elimination (residue of adversarial rejection sets, not intersection of agreements). Random input injection operationalises elimination boundary stability as sixth falsifiable prediction. Selection gate formalises ratchet monotonicity; tunneling criterion defines principled basin escape. Meta-model adjudication replaced by structural bias detection plus inter-generational pluralism resolution. Searles 1955 and Levitt 2021 introduced as prior art for shared blind spot propagation in verification hierarchies. Game-theoretic dominant strategy argument made explicit with operationalised loss term. Experiment 3 benchmark swapped to BIG-Bench Hard.

Files

Moral_Ratchet.pdf

Files (459.5 kB)

Name Size Download all
md5:cced04231a9171b1410c90d0b6fbe80e
459.5 kB Preview Download

Additional details

References

  • Bai, Yuntao, et al. 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
  • Bereska, Leonard, and Stratis Gavves. 2024. Mechanistic Interpretability for AI Safety—A Review. arXiv:2404.14082.
  • Bhatia, Mehar, et al. 2025. Value Drifts: Tracing Value Alignment During LLM Post-Training. arXiv:2510.26707.
  • Greenblatt, Ryan, et al. 2024. Alignment Faking in Large Language Models. arXiv:2412.14093.
  • Huang, Wenlong, et al. 2022. Inner Monologue: Embodied Reasoning through Planning with Language Models. Conference on Robot Learning.
  • Irving, Geoffrey, Paul Christiano, and Dario Amodei. 2018. AI Safety via Debate. arXiv:1805.00899.
  • Jin, Haoran, et al. 2025. Internal Value Alignment in Large Language Models through Controlled Value Vector Activation. ACL 2025.
  • Kahneman, Daniel. 2011. Thinking, Fast and Slow. Farrar, Straus and Giroux.
  • Krakovna, Victoria, et al. 2020. Avoiding Side Effects in Complex Environments. NeurIPS 33.
  • Ouyang, Long, et al. 2022. Training language models to follow instructions with human feedback. NeurIPS 35.
  • Xie, Roy, et al. 2025. Interleaved Reasoning for Large Language Models via Reinforcement Learning. arXiv:2505.19640.
  • Yudkowsky, Eliezer. 2004. Coherent Extrapolated Volition. Machine Intelligence Research Institute.
  • Zelikman, Eric, et al. 2024. Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking. arXiv:2403.09629.