Published November 15, 2025 | Version v1
Preprint Open

Ouroboros: Human-Led Recursive Reinforcement for Autoregressive Language Models

Description

Large language models (LLMs) are commonly aligned with human preferences via RLHF or direct preference optimization. We introduce Ouroboros, a human-led recursive reinforcement (HLRR) procedure that repeatedly distills a single teacher’s judgments, meta-commentary, and persona into future model behavior. In contrast to conventional RLHF—which freezes supervision into a static rewarder—Ouroboros closes the loop: model outputs are archived, summarized, and then re-expressed as deliberately intricate “labyrinth” prompts that probe coherence and reasoning. The same human then scores and rewrites the exchange, producing rich signals that assess factuality, logical self-consistency, and identity coherence. Across three base models (GPT-J 6B, Llama-2 70B, GPT-4o), Ouroboros improves long-horizon factual accuracy by 8–14 percentage points, roughly halves adversarial mode collapse, and reaches a target persona about 3× faster than RLHF baselines. We release code, evaluation suites, and annotated traces to support reproducibility.

Files

main.pdf

Files (597.1 kB)

Name Size Download all
md5:ff4965025f861bb4d6f39e7accd9908c
597.1 kB Preview Download

Additional details

Software

Repository URL
https://github.com/paytonison/ouroboros
Development Status
Active