There is a newer version of the record available.

Published May 24, 2026 | Version 3.0
Preprint Open

Dual-Head Attention Enables Length Generalization in Transformer Multiplication

Authors/Creators

  • 1. Independent Researcher

Description

Transformers fail to generalize beyond training lengths on arithmetic tasks.

We argue the root cause is geometric: dot-product attention projects onto the subspace spanned by training data, and cannot capture structural patterns that are orthogonal to content similarity.

We introduce Dual-Head Attention, which adds Gram-Schmidt-orthogonalized sine heads alongside standard cosine heads. 

On N×N integer multiplication, an 883K-parameter model trained on 1-6 digit operands achieves 80.6% exact-match accuracy on 7-10 digit unseen operands,

where a standard Transformer with identical capacity scores near zero.

The model uses no scratchpad and no task-specific positional encoding.

 Code: https://github.com/yzb3001313-star/Dual-Head-Attention-Enables-Length-Generalization

Files

Dual-Head Attention Enables Length Generalization_V03.pdf

Files (239.4 kB)

Name Size Download all
md5:1cad310cbc548bb498923c6899b3d385
197.4 kB Preview Download
md5:3b26131f63dd9cb1f8b4c5fd5511fb91
42.0 kB Download

Additional details