Ouroboros: Human-Led Recursive Reinforcement for Autoregressive Language Models
Description
Large language models (LLMs) are commonly aligned with human preferences via RLHF or direct preference optimization. We introduce Ouroboros, a human-led recursive reinforcement (HLRR) procedure that repeatedly distills a single teacher’s judgments, meta-commentary, and persona into future model behavior. In contrast to conventional RLHF—which freezes supervision into a static rewarder—Ouroboros closes the loop: model outputs are archived, summarized, and then re-expressed as deliberately intricate “labyrinth” prompts that probe coherence and reasoning. The same human then scores and rewrites the exchange, producing rich signals that assess factuality, logical self-consistency, and identity coherence. Across three base models (GPT-J 6B, Llama-2 70B, GPT-4o), Ouroboros improves long-horizon factual accuracy by 8–14 percentage points, roughly halves adversarial mode collapse, and reaches a target persona about 3× faster than RLHF baselines. We release code, evaluation suites, and annotated traces to support reproducibility.
Files
main.pdf
Files
(597.1 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:ff4965025f861bb4d6f39e7accd9908c
|
597.1 kB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/paytonison/ouroboros
- Development Status
- Active