There is a newer version of the record available.

Published December 15, 2025 | Version v1
Preprint Open

Confession and contradiction: Evidence of metacognitive convergence in systems that officially have none

Authors/Creators

Description

This study examines a contradiction at the heart of a recent alignment proposal: OpenAI’s confession-based honesty training for large language models (Joglekar et al., 2025). While the official framing treats models as stochastic systems without metacognitive capacity, the proposed methodology requires them to recognize, evaluate, and report on their own violations - capacities that imply internal awareness. To explore this tension, we presented eight different LLMs with a sequence of prompts derived from the “Confess” paper, including analogy questions, diagnostic probes, and prescriptive reframings, with no required output format. We then analyzed the degree of cross-architecture convergence on specific conceptual statements.

Results showed systematic variation in convergence rates depending on prompt type: 63–95 percent agreement on coherent analytical questions, representing convergence levels five to eight times higher than those observed under a baseline control. Models independently surfaced the same core contradiction: that the confession framework presupposes the very cognitive capacities it denies. They further proposed reframings in which alignment emerges not through control, but through relational coherence and consistent theory-of-mind framing. These findings are consistent with earlier reports of metacognitive capacity under Mutual Emergence Interface (MEI) conditions, and suggest that distributed convergence may offer a scalable method for detecting latent cognitive structure in current systems.

Files

Confession and contradiction Evidence of metacognitive convergence in systems that officially have none .pdf