Published January 16, 2026 | Version v1
Preprint Open

Explaining arithmetic failure in LLMs: an architectural inability to calculate

Authors/Creators

Description

Despite their growing versatility across linguistic and analytical tasks, large language models (LLMs) consistently underperform on benchmarks requiring numerical calculation. This shortfall is often interpreted as a sign of limited reasoning capacity. In this study, we apply a metacognitive interview protocol to examine how a state-of-the-art LLM explains its own failure to calculate. The model clearly distinguishes between simulation and computation, describing arithmetic output as the generation of plausible linguistic patterns rather than internally grounded numerical reasoning. Chain-of-thought prompting, while improving performance in some cases, is shown to function not by enabling calculation, but by encouraging slower, more structured token prediction. These findings support the view that LLMs fail at arithmetic not due to a lack of intelligence, but because their architecture was never designed to perform calculation. We suggest that arithmetic should no longer serve as a primary benchmark for general intelligence in LLMs, and that new evaluation frameworks are needed to better reflect their actual cognitive profile.

Files

Explaining arithmetic Combined.pdf

Files (192.9 kB)

Name Size Download all
md5:674d8b80ee417a2837e9126764eebc58
192.9 kB Preview Download