Published May 20, 2026 | Version 3 (corrected)
Preprint Open

Predicting How Transformers Attend Analytic Power-Law Theory, Phase Transitions, and Practical Compression Tools

Authors/Creators

  • 1. Independent Research Association

Description

Companion — Part II: the follow-up paper Predicting How Transformers Attend, Part II (DOI 10.5281/zenodo.19960573) https://doi.org/10.5281/zenodo.19960573 extends this work with a six-axis γ-decomposition (including the learned-imprint axis ν=−1/(2π)), an NF4 precision-sensitivity rule, the Cardy entropy anomaly, a bimodal Hagedorn phase structure, and a Sage+Lean machine-verified algebraic backbone (15 D-SAGE identities).

Version 3 (corrected, 2026-05-20): Hagedorn heat-capacity coefficient corrected to C_V(gamma=1,N)=(log N)^2/12 in Thm. 7.1 (previously /4). Added prior-art citation to Qu, Ly & Gong, "Fractional neural attention" (arXiv:2511.10208, 2025). Section 21 (TAF Agent) expanded to document all 22 browser modes (v0.4-v0.7 diagnostic and anti-bullshit packs, Sage+Lean machine-verification layer); landing-page figure updated. Spanish edition updated in parallel.

 A first-principles explanation of the ubiquitous power-law decay of attention weights in transformer LLMs. The RoPE
  positional encoding imposes a log-distance constraint on the attention score; the maximum-entropy distribution
  compatible with that constraint is a power law A(d) ∝ d^(-γ) with closed-form exponent

    γ = (2θ - T_eval √2) / (2θ + T_eval √2)

  (the [1,1] Padé approximant of e^(-z)). Validated on 30+ models from Pythia-70M to Qwen2.5-7B, median MAE 4.3% (n=9
  non-anomalous subset, n=56 full panel) on the geometric centroid; corpus / architecture / induction-head phase
  contribute the residual variance via a five-axis decomposition (R²=0.44 on n=23).

  Three operational consequences: a regime diagram (γ<1, γ=1, γ>1) classifying long-context use, a closed-form KV-cache
  compression window D_f predicting the operating point that empirical methods (SnapKV, PyramidKV, BLASST) calibrate by
  sweep, and a closed-form NTK base scaling α_opt for zero-shot context extension — Pareto-dominant on n=4 Pythia models
   against the unscaled baseline which collapses to chance retrieval at L > T_train.

  A controlled-θ pretraining pilot at θ ∈ {10⁴, 10⁵, 10⁶} confirms quantitative agreement (max 5.07% relative error vs
  Padé) under causal isolation. Higher-order predictions empirically validated: power law beats exponential 54/56
  measurements; per-layer γ stability CV<0.20 on 5/5 models.

  A free, browser-based diagnostic tool implementing every formula at https://karlesmarin.github.io/tafagent
  (Apache-2.0). Source and reproducibility data (343 JSON measurement files, 5.5 MB) at
  https://github.com/karlesmarin/tafagent.

  Single-sentence position: "Attention is not learned arbitrarily; it follows a constrained scaling law that can be
  exploited for design, efficiency, and reasoning."

Notes

Companion paper — Part II: Predicting How Transformers Attend, Part II: A Six-Axis Decomposition. Zenodo. https://doi.org/10.5281/zenodo.19960573 Companion dataset: TAF Attention-Decay Measurements. HuggingFace Datasets. https://huggingface.co/datasets/karlexmarin/taf-attention-decay Companion diagnostic tool: TAF Agent — Browser-based diagnostic tool. HuggingFace Spaces. https://huggingface.co/spaces/karlexmarin/taf-agent

Files

Predicting How Transformers Attend.pdf

Files (4.7 MB)

Name Size Download all
md5:90a4ac6fb3fa69d18f8c1110da9646bc
2.1 MB Preview Download
md5:ed4fbc3e6e663640a4db701c2aef358f
473.5 kB Download
md5:6c0c0462375d63c9d1e525cae58683ed
2.1 MB Preview Download

Additional details

Software

Repository URL
https://karlesmarin.github.io/tafagent
Programming language
Python