Ouroboros: Human-Led Recursive Reinforcement for Autoregressive Language Models

There is a newer version of the record available.

Published September 8, 2025 | Version v1

Preprint Open

Large Language Models (LLMs) typically rely on Reinforcement Learning from Human

Feedback (RLHF) or direct preference optimization to align generated text with human values.

We introduce Ouroboros, a recursive, human-led reinforcement (HLRR) method in which a

human curator cyclically distills their own evaluative judgments, meta-commentary, and persona

into the model’s future behavior. Unlike conventional RLHF—which treats human feedback as

a static reward signal—Ouroboros closes the loop between model and supervisor: each model

generation is archived, summarized, and syntactically “stretched” into labyrinthine prompts that

probe the model’s reasoning limits; the resulting conversation is then scored and rewritten by the

same human, producing richer signals that simultaneously assess content, self-consistency, and

identity coherence.

Files

Name	Size	Download all
ouroboros.pdf md5:dda6681f32ddc7a4f196d402a3817f76	448.4 kB	Preview Download

Views

Downloads

Show more details

DOI

Resource type

Preprint

Publisher

Zenodo

Languages

English

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more
Copyright: Copyright (C) Payton Ison & the Singularity