Attention Is All We Had — But Not What We Needed: Convergent State Machine for Iterative Energy-Based Language Generation
Authors/Creators
- 1. VKD Industries Private Limited, KAEL Division — Autonomous Reasoning Infrastructure, Lucknow, India
Description
We introduce the Convergent State Machine (CSM), a novel
architecture for language generation that replaces attention
entirely with energy-based iterative state refinement.
Three models trained (66M, 150M, 331M), all with zero
attention layers. CSM 150M matches GPT-2 1.5B on MMLU
within 0.4%, using 10x fewer parameters and 13x less data.
Key finding: hard problems show 60% perplexity improvement
with more iterations while easy problems degrade 60%.
The model reasons deeper on harder problems — not repeated
computation, but genuine difficulty-dependent reasoning.
Iteration scaling confirmed across three model sizes:
66M converges at iter 15, 150M at iter 30, 331M at iter 40.
Total training compute: under $100 of A100 GPU time.
Files
CSM_Paper_2.pdf
Files
(34.6 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:c654b234e86fe656cd7e69317c7a32bf
|
34.6 kB | Preview Download |