Published March 28, 2026 | Version v3
Report Open

The residual growth landscape: A 43-model survey of residual stream dynamics and a post-hoc intervention study

Authors/Creators

Description

We measure the residual stream growth factor, the ratio of the last-layer residual norm to the first-layer norm, across 43 openly available language models spanning 15 architecture families and 70M to 4B parameters. Residual growth varies by over 500x across models (from 5x in OLMo-2 to 2,747x in Qwen3-1.7B) and shows no correlation with parameter count (Spearman r = 0.043, p = 0.78).

We conduct two intervention studies on 9–10 models. First, Norm Equalization (NormEq), an analytical rescaling that forces uniform residual growth, degrades perplexity in 8 of 9 cases, with catastrophic failure (+2,073%) in Qwen3-0.6B. Second, progressive layer dropping reveals that resilience to depth reduction is uncorrelated with residual growth (r = -0.09, p = 0.80): Falcon-H1 (RG = 9x) is the most fragile model, while GPT2-XL (RG = 510x) degrades gracefully.

We conclude that heterogeneous residual growth is a learned feature of Pre-LN Transformer training, not an architectural defect, and that layer-level criticality depends on architecture type rather than residual dynamics.

Residual growth values reported in the survey and intervention files were measured under different calibration and evaluation setups, so overlapping models may have different absolute RG values across files. Cross-file comparisons should therefore use within-experiment values.

Files

residual_growth_report_v2.pdf

Files (1.1 MB)

Name Size Download all
md5:1c97a58e2d297bf14e2ac1b2363a0108
68.0 kB Preview Download
md5:f89e296bc8cd85b41d5f9d31de5fb182
1.1 MB Preview Download

Additional details

Software