Published January 14, 2026 | Version v1
Preprint Open

A narrow, testable proposal for reducing self-referential gaming in consistency-aware transformers by grounding control signals in external task outcomes.

Authors/Creators

Description

This paper analyzes a potential failure mode in consistency-enforcing neural architectures: self-referential control signals may be optimized through prediction rather than by achieving the underlying property they are intended to enforce.

We propose a narrow, testable mitigation: replacing self-referential consistency predictors with externally grounded failure-risk estimation trained on task outcomes. Because task success is externally determined, such risk signals cannot be trivially minimized through self-prediction.

We present a minimal control-field formulation, a synthetic experimental protocol designed to detect gaming behavior, and falsifiable evaluation criteria. The contribution is deliberately scoped: we do not claim a general solution or empirical superiority, only that externally grounded risk estimation may reduce susceptibility to self-referential gaming.

This work consolidates and builds upon prior consistency-aware architectures and is intended as a corrective analysis rather than a standalone model proposal. Replication and falsification are explicitly invited.

Files

risk_shaped_control_fields_final-1.pdf

Files (281.4 kB)

Name Size Download all
md5:07c6b3023d545f7b489e1ea2d1e4fdbd
281.4 kB Preview Download