Published June 2, 2026
| Version 2
Preprint
Open
DHP is a Recurrence Constraint: Full-Attention Transformers Evade the Dynamical Horizon Principle
Description
v2 (2026-06-02): Spatial analog negative result + Cosmos3 physical domain probe. RTM rotation requirement (θ≥87°) elevated to central finding. Abstract prose restructured: bridging sentence for trade-off framing, causal chain built for RTM rotation before Paper 22 unification. Cosmos3-Nano arch description corrected. Neuron structural audit passed.
The Dynamical Horizon Principle (DHP) is a universal constraint observed across diverse recurrent architectures (LSTMs, RWKV-7, CTMs, and noisy quantum recurrent circuits), enforcing a strict relation between task length T and the memory decay timescale : T_conv/ ≈ 0.72. In this work, we demonstrate that DHP is not a general property of gradient descent, but specifically a recurrence constraint. We evaluate sequence parity (temporal XOR) across three distinct regimes: 1. Recurrent Models (LSTM): Exhibit exponential training complexity growth from T=2 through T=10, culminating in optimization failure under standard training budgets (3k–10k steps) at T 12 (DHP cliff); convergence is recovered at extended budget (30k steps), confirming an exponential time barrier rather than a topological impossibility. The effective memory timescale implied by this cliff is _eff = T_cliff/0.72 ≈ 16.7 steps. 2. Full-Attention Transformers: Structurally evade Markovian decay. Training convergence time remains flat ( 140 steps) across all tested sequence lengths T 48, achieving 100\% convergence across all seeds. 3. Window-Attention Transformers: Exhibit a binary receptive-field visibility cliff at exactly T = 2W. For W=16, convergence is immediate below T=32 (receptive field boundary 2W-1 = 31), but drops instantly to 0\% at T 32. For W=32, the target remains within the 2W-1 = 63 receptive field for all tested lengths (T 48), achieving 100\% convergence. We formalize the mathematics of this division: recurrence forces multiplicative gradient decay through time, while self-attention constructs a direct routing topology that bypasses recurrence decay. Window attention replaces the gradient-decay cliff with a hard visibility boundary. We conclude that DHP represents the boundary of information flow through Markovian recurrences, which attention-based models structurally circumvent.Notes
Files
paper28_v2_FINAL.pdf
Files
(1.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:6a25ec8f3119e4026e4a2f1e6a719f44
|
1.1 MB | Preview Download |