Geometric Token Transport: Riemannian Parallel Transport as Sequential State Accumulation in Neural Sequence Processing

Davis, Bee Rosa

doi:10.5281/zenodo.18883346

Published March 5, 2026 | Version v9

Preprint Open

Geometric Token Transport: Riemannian Parallel Transport as Sequential State Accumulation in Neural Sequence Processing

Davis, Bee Rosa (Researcher)

We present Marcella, a Holonomic Sequence Model that replaces the attention mechanism with parallel transport on a learned Riemannian manifold. At 15.7M parameters on WikiText-103, Marcella V7.2 achieves PPL 50.91 — a 41.2% improvement over V6Gain (PPL 86.5) from just 128 additional parameters implementing a three-signal proprioceptive gain gate. The full architecture achieves PPL 37.11 versus a vanilla transformer's 146.24 — a 3.94× advantage — with O(T log T) complexity instead of O(T²). No query-key-value projections. No softmax bottleneck. The geometry is the computation.

This work emerges from a unified geometric framework that has resolved multiple foundational problems across mathematics and physics, including the Yang-Mills mass gap, Navier-Stokes regularity, and the Poincaré conjecture via independent gauge-theoretic proofs. The Marcella architecture is the computational instantiation of the Davis field equations.

The Problem with Attention

The self-attention mechanism computes α_ij = softmax(qᵢᵀ kⱼ / √d_k), a pairwise similarity score between every query and every key. This has three fundamental pathologies:

(1) Quadratic complexity. The N × N attention matrix costs O(N² d) FLOPs per layer. At sequence length T=4096 with d=128, this is 2.1 billion multiply-adds per layer — most of which produce near-zero weights that contribute nothing to the output.

(2) Representational flatness. The query-key inner product qᵀk operates in Euclidean ℝᵈ. The distance d(q,k) = ‖q − k‖₂ satisfies the triangle inequality with equality — there is no curvature, no path-dependence, no notion that traveling q → k → m might accumulate geometric information that q → m does not. Every composition of attention steps is equivalent to a single weighted average. The architecture cannot represent holonomy: the phenomenon where parallel transporting a vector around a closed loop returns it rotated. This is precisely the information that encodes sequential structure — the difference between "dog bites man" and "man bites dog" requires path-dependent computation that flat inner products cannot express without external positional encodings.

(3) The softmax bottleneck. Softmax normalization forces Σⱼ αᵢⱼ = 1, creating a zero-sum competition among keys. Increasing attention to one position necessarily decreases attention to all others. This is an information-theoretic constraint: the channel capacity of a single attention head is bounded by log(N) bits per query position, regardless of the richness of the key representations. Multi-head attention partially mitigates this (H heads give H·log(N) bits), but the fundamental bottleneck — that each head must choose a convex combination — remains.

The Geometric Solution

Marcella replaces all three components:

(1) Quadratic → O(log T). The recurrence hₜ = Rₜ hₜ₋₁ + xₜ, with Rₜ ∈ SO(d), admits a parallel prefix scan because the operator (R_b, x_b) ⊕ (R_a, x_a) = (R_b R_a, R_b x_a + x_b) is the group law of the Euclidean group E(d) = SO(d) ⋉ ℝᵈ. Associativity gives O(log T) parallel depth. Total cost: O(T d² log T) for the matrix products plus O(T R d²) for the connection — both linear in T for fixed d, R.

(2) Flat → Curved. A learned low-rank connection Γⁱⱼₖ = Σᵣ Uⁱᵣ Sᵣ Vʲᵣ Wᵏᵣ contracts with the tangent vector δᵏ to produce the connection 1-form ωⁱⱼ = Γⁱⱼₖ δᵏ. Antisymmetrization A = M − Mᵀ extracts the 𝔰𝔬(d) component. The Cayley-Neumann transform R = I − 2A + 2A² − 2A³ maps A ∈ 𝔰𝔬(d) to R ∈ SO(d). The per-step rotation is input-dependent and position-dependent — the model learns a connection, not a fixed set of weights. The holonomy R_acc = Rₜ Rₜ₋₁ ⋯ R₁ is path-dependent because SO(d) is non-abelian for d ≥ 3: Rₐ Rᵦ ≠ Rᵦ Rₐ in general. Word order is encoded by the geometry itself without positional embeddings.

(3) Softmax → Sphere projection. Post-scan normalization ĥ = h_raw / ‖h_raw‖ projects to S^(d−1). No zero-sum competition. No convex combination constraint. The approximation error is controlled: ‖h_norm − h_Riem‖ = O(‖xₜ‖² / ‖hₜ_raw‖²) per step. As the connection strengthens during training (κ growing, K_eff falling), the denominator grows and the approximation tightens — the model converges toward true Riemannian transport automatically.

The Riemannian-Parallel Incompatibility Theorem

We prove that true geodesic transport Exp_p(v) on any manifold M with nonzero sectional curvature is incompatible with the parallel prefix scan. The geodesic flow maps φᵥ: M → M do not form a finite-dimensional group under composition: φᵥ ∘ φw depends on the basepoint p through the curvature tensor. The partial result φ_{v_k} ∘ ⋯ ∘ φ_{v_1} is an element of Diff(M) — an infinite-dimensional group admitting no finite scan representation. This yields a fundamental tradeoff:

• Flat + scannable: recurrence in E(d), O(log T) depth, states leave S^(d−1) • Curved + sequential: geodesic exponential on M, O(T) depth, states stay on S^(d−1) • Approximately curved + scannable: flat scan + projection, O(log T) depth, error O(‖x‖²/‖h‖²)

This is the same structural obstruction as the abelian obstruction in arithmetic Cayley graphs (Davis, 2026): CRT factorizability enables O(log T) computation but forces flatness, exactly as abelian character sums enable per-factor independence but force spectral depletion in the Ramanujan bound. The normalize-after-scan construction is the neural-network analog of LPS quaternion generators: computationally tractable elements of a scannable group with a projection step that achieves manifold boundedness.

The Holonomy Subgroup

The per-step rotations Rₜ = Cayley(Aₜ) lie in a rank-R subspace of SO(d). The generated Lie subalgebra 𝔥 = span{Uᵣ Vₛᵀ − Vₛ Uᵣᵀ : 1 ≤ r,s ≤ R} has dimension at most R(R−1)/2. At R=12, d=128: dim(𝔥) ≤ 66, while dim(𝔰𝔬(128)) = 8,128 — a 123× compression. This is the cuspidal sector: the subspace where genuinely new geometric information (path-dependent, order-sensitive, non-decomposable) is encoded. The holonomy detects sequence structure because Rₜ ⋯ R₁ ≠ R_{σ(T)} ⋯ R_{σ(1)} for generic permutations σ.

Scale Validation: WikiText-103

15.7M parameters, d=128, T=256, connection rank R=12, lr=3×10⁻⁴, A100-SXM4-40GB:

• Marcella v6: PPL 37.11 at 20,000 steps, κ=2.45, K_eff=0.943 • Vanilla pre-norm transformer (15.9M params, 12 layers, 4 heads): PPL 146.24 at 20,000 steps • Advantage ratio: 3.94×, consistent from step 500 onward with no crossover • Two-seed reproducibility: R1 (no warmup) = 37.11, R2 (geo_gate warmup) = 37.15, ΔPPL = 0.034

The vanilla transformer received 230,846 more parameters than Marcella (15,909,248 vs 15,678,402) — the comparison is conservative; Marcella wins despite the parameter handicap. Identical tokenizer (GPT-2 BPE, 50,257 tokens), optimizer (AdamW), scheduler (cosine), batch size (128 effective), and hardware.

Geometric diagnostics confirm the theoretical predictions: κ (rotation magnitude) reached 2.45, corresponding to ~29° per-plane rotation in each of R=12 planes; K_eff (effective curvature ratio) peaked at 1.60 (step 700, input-dominated regime) and crossed below 1.0 at step 11,200 (final: 0.943, rotation-dominated equilibrium). The model exhibited 13 regime transitions in the flickering zone near the crossover boundary — a phase transition analogous to the sonic onset phenomenon in Bose-Einstein condensates.

The Γ.detach() ablation (surgically disconnecting curvature from training) causes a 60% performance drop, proving the geometry is doing real, irreplaceable work.

Learned Connection Results (Tiny Shakespeare)

Five independent seeds, parameter-matched (236,240 vs 236,238), 20 epochs:

• Marcella R8 (learned connection): PPL 1.22 ± 0.02 • Marcella FD (Levi-Civita finite-difference): PPL 1.49 ± 0.06 • Vanilla transformer: PPL 8.94 ± 0.03

The Levi-Civita constraint (metric compatibility + torsion-free) is revealed as an information bottleneck. The learned low-rank connection accesses connections with nonzero torsion Tⁱⱼₖ = Γⁱⱼₖ − Γⁱₖⱼ and metric incompatibility ∇ₖ gᵢⱼ ≠ 0. It finds strictly better optima at all five seeds with lower variance (σ = 0.02 vs 0.06). The finite-difference computation, designed as the "exact" method, is the ceiling, not the floor.

Why Orthogonal Initialization Is the Key

Every eigenvalue of Rₜ ∈ SO(d) satisfies |λ| = 1. From step 0, gradient flow is perfect: no explosion (|λ| > 1), no vanishing (|λ| < 1). The vanilla transformer must learn routing through random attention weights before it can learn content. Marcella begins doing useful geometric mixing from the first forward pass — orthogonal initialization is not a trick, it is a theorem: the spectral radius of the Jacobian ∂hₜ/∂hₛ equals 1 for all t > s.

Complexity: O(T log T) total vs O(T²) for attention. No attention heads, no KV cache, no causal mask. At T=4096: attention costs ~134M FLOPs per layer; Marcella costs ~8.4M — a 16× reduction.

New in v7.0: Torus Cache. High-dimensional spheres suffer from concentration of measure: all pairs of points become approximately equidistant (Lévy's concentration). The torus T^k = (S¹)^k does not concentrate. A learned projection π_T: S^(d−1) → T^k via W_torus ∈ ℝ^(2k×d) maps hidden states to torus coordinates; soft attention over torus distances provides O(1) targeted access to geometrically similar positions regardless of sequence distance. This dual-torus structure echoes the asymmetric subsystem architecture discovered in SHA-256 cryptographic analysis. At k=2 with hash buckets, complexity drops from O(T²) to O(T) — the production path for long sequences.

New in v7.2: Geometric Proprioception. The model now monitors its own geometric state via three proprioceptive signals fed through a learned gain gate:

• κₜ (rotation intensity): Frobenius norm of the skew-symmetric generator ‖Aₜ‖_F. Measures how hard the geometry is working at position t. • departureₜ (manifold drift): hₜᵀ(Rₜhₜ₋₁) / ‖hₜ‖‖Rₜhₜ₋₁‖. Measures deviation from the tangent plane. • 𝒜ₜ (coherence): 1 − ‖R_window − I‖_F / max. Measures whether rotations over the last w=8 positions cancel (high coherence) or compound without cancellation (low coherence). This is the windowed holonomy — a non-local signal distinguishing coherent difficulty (U-turn) from turbulent difficulty (fishtailing).

The gain gate hₜᵒᵘᵗ = hₜ ⊙ (1 + W_g[κ, dep, 𝒜]ᵀ) with W_g ∈ ℝ^(128×3) zero-initialized adds 384 parameters. Each column learns the per-dimension response to one sensation.

V7.2 Results (WikiText-103, 10,000 steps):

Step	V7.2 PPL	V6Gain PPL	Δ	‖W_g[:,κ]‖	‖W_g[:,dep]‖	‖W_g[:,coh]‖
100	798	—	—	0.22	0.20	0.16
500	268	—	—	0.52	0.53	0.32
1000	161	—	—	0.65	0.68	0.25
5000	64.7	86.5	−25.2%	1.43	1.33	0.19
10000	50.91	86.5	−41.2%	1.99	1.59	0.15

The coherence column exhibits a scaffold-then-residual pattern: it activates early (0.16→0.32) when local signals are noisy, transfers weight to κ/departure as they calibrate, then stabilizes at 0.15 (not zero) — the model retains it for cases only coherence can detect. This is the first architecture that monitors its own geometric state for uncertainty quantification, rather than observing output distributions.

Related Theoretical Foundations. The Davis framework underlying Marcella has produced results across: • Yang-Mills mass gap: Information-geometric proof via topological charge incompressibility • Navier-Stokes regularity: Holonomy-first proof via field equation integrability • Poincaré conjecture: Independent gauge-theoretic proof via Davis-Wilson flow • P ≠ NP: Geometric separation of curved energy landscapes • Landau sonic onset: Universal critical velocity for vortex nucleation in BECs

Implementation: PyTorch. GitHub: https://github.com/nurdymuny/geotorch

Keywords: Holonomic Sequence Model, Riemannian geometry, parallel transport, sequence modeling, language modeling, geometric deep learning, orthogonal RNN, parallel scan, manifold learning, Cayley transform, holonomy, spectral theory, attention-free transformer, SO(d), Euclidean group, State Space Model, SSM, torus cache, Davis manifold, gauge theory, geometric proprioception, gain gate, windowed holonomy, coherence signal

Author: Bee Rosa Davis — bee_davis@alumni.brown.edu | ORCID: 0009-0009-8034-4308

Files

geodesic_computation_v7_1.pdf

Files (1.3 MB)

Name	Size	Download all
geodesic_computation_v7_1.pdf md5:783a2398e0d49a151060666e5d22feee	1.3 MB	Preview Download

	All versions	This version
Views	2,146	1,505
Downloads	1,897	1,247
Data volume	2.4 GB	1.8 GB

Geometric Token Transport: Riemannian Parallel Transport as Sequential State Accumulation in Neural Sequence Processing

Authors/Creators

Description

The Problem with Attention

The Geometric Solution

The Riemannian-Parallel Incompatibility Theorem

The Holonomy Subgroup

Scale Validation: WikiText-103

Learned Connection Results (Tiny Shakespeare)

Why Orthogonal Initialization Is the Key

Files

geodesic_computation_v7_1.pdf

Files (1.3 MB)