Dual-Head Attention Enables Length Generalization in Transformer Multiplication

Yan, Tianshi

doi:10.5281/zenodo.20374174

Published May 25, 2026 | Version 4.0

Preprint Open

Dual-Head Attention Enables Length Generalization in Transformer Multiplication

Yan, Tianshi¹

1. Independent Researcher

Transformers fail to generalize beyond training lengths on arithmetic tasks.

We argue the root cause is geometric: dot-product attention projects onto the subspace spanned by training data, and cannot capture structural patterns that are orthogonal to content similarity.

We introduce Dual-Head Attention, which adds Gram-Schmidt-orthogonalized sine heads alongside standard cosine heads.

On N×N integer multiplication, an 883K-parameter model trained on 1-6 digit operands achieves 80.6% exact-match accuracy on 7-10 digit unseen operands,

where a standard Transformer with identical capacity scores near zero.

The model uses no scratchpad and no task-specific positional encoding.

Code: https://github.com/yzb3001313-star/Dual-Head-Attention-Enables-Length-Generalization

Files

Dual-Head Attention Enables Length Generalization_V04.pdf

Files (355.4 kB)

Name	Size	Download all
Dual-Head Attention Enables Length Generalization_V04.pdf md5:3239ec2e3d823ac81ee81eaa62c00cd7	313.4 kB	Preview Download
V02_方案四_对偶头分叉.py md5:3b26131f63dd9cb1f8b4c5fd5511fb91	42.0 kB	Download

Additional details

Is new version of: Preprint: 10.5281/zenodo.20368685 (DOI)

Repository URL: https://github.com/yzb3001313-star/Dual-Head-Attention-Enables-Length-Generalization
Programming language: Python

	All versions	This version
Views	135	12
Downloads	50	6
Data volume	13.8 MB	2.8 MB

Dual-Head Attention Enables Length Generalization_V04.pdf

Files (355.4 kB)

Related works

Software

Dual-Head Attention Enables Length Generalization in Transformer Multiplication

Authors/Creators

Description

Files

Dual-Head Attention Enables Length Generalization_V04.pdf

Files (355.4 kB)

Additional details

Related works

Software