The residual growth landscape: A 43-model survey of residual stream dynamics and a post-hoc intervention study
Authors/Creators
Description
We measure the residual stream growth factor, the ratio of the last-layer residual norm to the first-layer norm, across 43 openly available language models spanning 15 architecture families and 70M to 4B parameters. Residual growth varies by over 500x across models (from 5x in OLMo-2 to 2,747x in Qwen3-1.7B) and shows no correlation with parameter count (Spearman r = 0.043, p = 0.78).
We conduct two intervention studies on 9–10 models. First, Norm Equalization (NormEq), an analytical rescaling that forces uniform residual growth, degrades perplexity in 8 of 9 cases, with catastrophic failure (+2,073%) in Qwen3-0.6B. Second, progressive layer dropping reveals that resilience to depth reduction is uncorrelated with residual growth (r = -0.09, p = 0.80): Falcon-H1 (RG = 9x) is the most fragile model, while GPT2-XL (RG = 510x) degrades gracefully.
We conclude that heterogeneous residual growth is a learned feature of Pre-LN Transformer training, not an architectural defect, and that layer-level criticality depends on architecture type rather than residual dynamics.
Residual growth values reported in the survey and intervention files were measured under different calibration and evaluation setups, so overlapping models may have different absolute RG values across files. Cross-file comparisons should therefore use within-experiment values.
Files
residual_growth_report_v2.pdf
Files
(1.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:1c97a58e2d297bf14e2ac1b2363a0108
|
68.0 kB | Preview Download |
|
md5:f89e296bc8cd85b41d5f9d31de5fb182
|
1.1 MB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/Ono-Katsuki/residual-growth-report
- Programming language
- Python