There is a newer version of the record available.

Published March 31, 2026 | Version v4
Preprint Open

Separate and Amplify: Attention's Geometry of Retrieval

Authors/Creators

Description

Using the Tuple-Structured Associative Recall task to isolate retrieval, we demonstrate that Transformer models learn high-magnitude spherical codes (sets of vectors with a guaranteed minimum angular separation) and can achieve perfect accuracy and robust length generalization down to single-digit head dimensions. We show by construction that attention's single-head retrieval capacity $N$ approaches the representational limit of the subspaces it projects from, and is thus unbounded with real numbers. Given $b$ bits per coordinate of input, capacity scales as $N \approx 2^{bd_k}$, or equivalently $N \approx 2^B$ for some total number of bits $B$. Head dimension $d_k \geq 2$ does not increase capacity, but influences how efficiently a given spherical code can approach this representational limit.

Files

attn_capacity_paper.pdf

Files (6.6 MB)

Name Size Download all
md5:0f4b0e544933e573d4dd17d1e79bd7b5
6.6 MB Preview Download

Additional details

Software

Repository URL
https://github.com/tmaselko/paper-attncap
Programming language
Python