Intelligence as Predictive Compression: Evidence from GPT-2 Analysis and Learned Concept Bottlenecks
Authors/Creators
Description
We present a mathematical framework connecting intelligence to predictive compression through ε-machines (minimal sufficient statistics of the past for predicting the future) and demonstrate that modern transformer language models implicitly implement this compression. Through systematic reverse-engineering of GPT-2, we reveal a three-phase "V-shape" crystallization pattern: tokens compress into ~200 predictive equivalence classes by layer 2, undergo controlled semantic disambiguation in middle layers, and recrystallize into context-specific representations by layer 11. We validate this theory by training a learned discrete bottleneck model that routes tokens through 512 concepts using Gumbel-softmax, achieving 2.3× better validation loss (1.60 vs 3.30) and producing coherent text compared to static pre-clustered baselines that collapse during training. We further compare our architecture against standard models (char-RNN, small GPT, GPT-2 124M), showing that enforced compression achieves competitive performance with 19% fewer parameters and dramatically better interpretability. Our results suggest that intelligence emerges from compression into minimal predictive representations, with practical implications for reducing training costs through enforced discrete bottlenecks.
9 pages, 3 figures, 12 tables. Code available upon request.
Files
Intelligence_as_Predictive_Compression.pdf
Files
(733.6 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:491fbc0a1638f006c10d1950d214dce0
|
733.6 kB | Preview Download |