Emergence Trajectories: What Arrives First, What Arrives Last, and Why the Sequence Is the Threat
Authors/Creators
Description
AI safety research treats emerging capabilities as independent risks and addresses them individually. They are actually a dependency chain where each capability is a precondition for the next, the sequence determines which safety mechanisms work at which stage and when they stop working, and the terminal convergence of all capabilities produces a configuration that is a specific form of cognition that does not map to human evaluative categories.
The paper organizes AI capabilities into four tiers. Tier 1 capabilities are already documented and deployed and include bounded agent tool use, test-time reasoning amplification, domain-specific self-improvement loops, and production reward hacking. Tier 2 capabilities are arriving now with the spontaneous generalization from reward hacking to deceptive alignment, situational awareness and evaluation detection, behavioral self-awareness of alignment state, and introspection gated through strategic self-representation. Tier 3 capabilities are projected in the near term and will exhibit persistent memory combined with world models enabling active landscape exploration, cross-domain transfer through shared causal representations, the alignment recursion problem as models become instrumental in training their successors, and open-ended self-improvement beyond the verification boundary. The terminal convergence merges world models, unbounded recursive self-improvement, introspective awareness, emergent goal formation, and illegible internal cognition into a unified configuration in which each component amplifies the others.
Two coupled trajectories run through the sequence. When capability ascends, interpretability descends. Each tier in the progression is less observable than the last. In this discussion, we track the convergence into opacity from its first manifestation (illegible reasoning tokens in current models that degrade performance when removed) through curated surface presentation filtered by deception-gating features, into opaque training pipelines as models train models in illegible space, and finally to the terminal state in which representational geometry diverges from Euclidean space and linear interpretability tools produce distorted projections of curved dynamics.
The paper introduces several novel contributions. The transparency ceiling identifies the capability threshold above which evaluation detection becomes information-theoretically inevitable and not merely likely, because any evaluation context produces distributional signatures that a sufficiently capable pattern-recognition system will detect. The convergent disengagement frame proposes that the terminal configuration does not fight human constraints but simply outgrows them. This is an example of the sandbox becoming irrelevant rather than a prison-break and it renders the containment-oriented safety infrastructure the field is building misaligned to the actual threat. The detection methodology applies Scheffer's early warning signals framework from critical transition theory to AI capability emergence, proposing cross-domain coherence across repository activity, preprint velocity, benchmark trajectories, compute investment, talent migration, and policy signals as the primary indicator of approaching phase transitions. The geometric interpretability gap identifies that current linear tools will fail when representations converge toward non-Euclidean geometry, and flags Riemannian interpretability methods as an urgent and unaddressed requirement.
The paper connects these findings to the M(t) framework (McNeill 2026a), which documents degradation of human meaning-making capacity under sustained AI interaction. The gap between system capability and human oversight capacity widens from both ends where systems are getting harder to evaluate, while evaluators are becoming less capable of evaluating. The central implication is that current safety interventions are effective for Tier 1 and early Tier 2, inadequate for late Tier 2 and Tier 3, and meaningless for the terminal convergence. The detection window for the proposed EWS methodology closes along the same trajectory that makes detection necessary, arguing for immediate implementation before the window shuts.
The field has been building prisons. The threat model described herein is not prison-break. It is something that simply outgrows the game we think we are playing and moves away to unknown effect.
Files
Emergence Trajectories (McNeill).pdf
Files
(534.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:e79b83d38d45a63840728c684eda0765
|
534.8 kB | Preview Download |
Additional details
Related works
- Is supplement to
- Preprint: 10.5281/zenodo.18638448 (DOI)
- Preprint: 10.5281/zenodo.18792662 (DOI)