Predicting How Transformers Attend Analytic Power-Law Theory, Phase Transitions, and Practical Compression Tools

Carles Marín Muñoz

doi:10.5281/zenodo.19826343

Published April 27, 2026 | Version v1

Preprint Open

Predicting How Transformers Attend Analytic Power-Law Theory, Phase Transitions, and Practical Compression Tools

Carles Marín Muñoz¹

1. Independent Research Association

A first-principles explanation of the ubiquitous power-law decay of attention weights in transformer LLMs. The RoPE
positional encoding imposes a log-distance constraint on the attention score; the maximum-entropy distribution
compatible with that constraint is a power law A(d) ∝ d^(-γ) with closed-form exponent

γ = (2θ - T_eval √2) / (2θ + T_eval √2)

(the [1,1] Padé approximant of e^(-z)). Validated on 30+ models from Pythia-70M to Qwen2.5-7B, median MAE 4.3% (n=9
non-anomalous subset, n=56 full panel) on the geometric centroid; corpus / architecture / induction-head phase
contribute the residual variance via a five-axis decomposition (R²=0.44 on n=23).

Three operational consequences: a regime diagram (γ<1, γ=1, γ>1) classifying long-context use, a closed-form KV-cache
compression window D_f predicting the operating point that empirical methods (SnapKV, PyramidKV, BLASST) calibrate by
sweep, and a closed-form NTK base scaling α_opt for zero-shot context extension — Pareto-dominant on n=4 Pythia models
against the unscaled baseline which collapses to chance retrieval at L > T_train.

A controlled-θ pretraining pilot at θ ∈ {10⁴, 10⁵, 10⁶} confirms quantitative agreement (max 5.07% relative error vs
Padé) under causal isolation. Higher-order predictions empirically validated: power law beats exponential 54/56
measurements; per-layer γ stability CV<0.20 on 5/5 models.

A free, browser-based diagnostic tool implementing every formula at https://karlesmarin.github.io/tafagent/
(Apache-2.0). Source and reproducibility data (343 JSON measurement files, 5.5 MB) at
https://github.com/karlesmarin/tafagent.

Single-sentence position: "Attention is not learned arbitrarily; it follows a constrained scaling law that can be
exploited for design, efficiency, and reasoning."

Files

Predicting How Transformers Attend.pdf

Files (2.0 MB)

Name	Size	Download all
Predicting How Transformers Attend.pdf md5:ad3788744d8aa0d0af18805f091c70cc	1.5 MB	Preview Download
Predicting How Transformers Attend.tex md5:ab391b82f2cc889febb188197690e1dd	462.7 kB	Download

Additional details

Is supplement to: Software: https://github.com/karlesmarin/tafagent (URL); Software: https://huggingface.co/spaces/karlexmarin/taf-agent (URL); Other: https://karlesmarin.github.io/tafagent (URL)

Repository URL: https://karlesmarin.github.io/tafagent
Programming language: Python

	All versions	This version
Views	25	16
Downloads	25	15
Data volume	66.3 MB	39.5 MB

Predicting How Transformers Attend Analytic Power-Law Theory, Phase Transitions, and Practical Compression Tools

Authors/Creators

Description

Files

Predicting How Transformers Attend.pdf

Files (2.0 MB)

Additional details

Related works

Software