Dual-Head Attention Enables Length Generalization in Transformer Multiplication
Description
Transformers fail to generalize beyond training lengths on arithmetic tasks.
We argue the root cause is geometric: dot-product attention projects onto the subspace spanned by training data, and cannot capture structural patterns that are orthogonal to content similarity.
We introduce Dual-Head Attention, which adds Gram-Schmidt-orthogonalized sine heads alongside standard cosine heads.
On N×N integer multiplication, an 883K-parameter model trained on 1-6 digit operands achieves 80.6% exact-match accuracy on 7-10 digit unseen operands,
where a standard Transformer with identical capacity scores near zero.
The model uses no scratchpad and no task-specific positional encoding.
Code: https://github.com/yzb3001313-star/Dual-Head-Attention-Enables-Length-Generalization
Files
Dual-Head Attention Enables Length Generalization_V03.pdf
Files
(239.4 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:1cad310cbc548bb498923c6899b3d385
|
197.4 kB | Preview Download |
|
md5:3b26131f63dd9cb1f8b4c5fd5511fb91
|
42.0 kB | Download |
Additional details
Software
- Repository URL
- https://github.com/yzb3001313-star/Dual-Head-Attention-Enables-Length-Generalization
- Programming language
- Python