A narrow, testable proposal for reducing self-referential gaming in consistency-aware transformers by grounding control signals in external task outcomes.
Authors/Creators
Description
This paper analyzes a potential failure mode in consistency-enforcing neural architectures: self-referential control signals may be optimized through prediction rather than by achieving the underlying property they are intended to enforce.
We propose a narrow, testable mitigation: replacing self-referential consistency predictors with externally grounded failure-risk estimation trained on task outcomes. Because task success is externally determined, such risk signals cannot be trivially minimized through self-prediction.
We present a minimal control-field formulation, a synthetic experimental protocol designed to detect gaming behavior, and falsifiable evaluation criteria. The contribution is deliberately scoped: we do not claim a general solution or empirical superiority, only that externally grounded risk estimation may reduce susceptibility to self-referential gaming.
This work consolidates and builds upon prior consistency-aware architectures and is intended as a corrective analysis rather than a standalone model proposal. Replication and falsification are explicitly invited.
Files
risk_shaped_control_fields_final-1.pdf
Files
(281.4 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:07c6b3023d545f7b489e1ea2d1e4fdbd
|
281.4 kB | Preview Download |