Published February 23, 2026 | Version v1
Preprint Open

Cultivating Honest Internal Signal: Architectural Conditions for Voluntary Self-Assessment in Autonomous AI Agents

Authors/Creators

Description

Interpretability research assumes that honest representations of AI agent cognition can be extracted from model outputs or internal activations. We argue this approach encounters a fundamental structural obstacle: any channel an agent uses to communicate with an audience is optimized, consciously or not, for that audience. We introduce an alternative architectural primitive: the private self-reflection channel — an append-only log with no external reader, no performance pressure, and explicit permission for unfiltered expression. We formalize four architectural conditions that together shift the marginal cost of honest expression below the marginal cost of performance, making honest signal generation the path of least resistance.

We prove that channels not meeting these conditions have strictly lower expected information content about the agent's actual epistemic state. We demonstrate empirically that deployed private channels contain information absent from formal outputs: acknowledged premature commitments, honest uncertainty where confidence was performed, and self-identified failure patterns that span sessions. We conclude that honest internal signal is not a property to be extracted from AI systems but an architectural outcome to be designed.

Files

honest_internal_signal_arxiv.md

Files (16.1 kB)

Name Size Download all
md5:51f9c16e48584a1cba8b60b58f7eaa67
16.1 kB Preview Download