Transformers without Tears: Improving the Normalization of Self-Attention

doi:10.5281/zenodo.3525484

Published November 2, 2019 | Version v1

Conference paper Open

Transformers without Tears: Improving the Normalization of Self-Attention

1. University of Notre Dame
2. Amazon AWS AI

We evaluate three simple, normalization-centric changes to improve Transformer training. First, we show that pre-norm residual connections (PRENORM) and smaller initializations enable warmup-free, validation-based training with large learning rates. Second, we propose l2 normalization with a single scale parameter (SCALENORM) for faster training and better performance. Finally, we reaffirm the effectiveness of normalizing word embeddings to a fixed length (FIXNORM). On five low-resource translation pairs from TED Talks-based corpora, these changes always converge, giving an average +1.1 BLEU over state-of-the-art bilingual baselines and a new 32.8 BLEU on IWSLT '15 English-Vietnamese. We ob- serve sharper performance curves, more consistent gradient norms, and a linear relationship between activation scaling and decoder depth. Surprisingly, in the high-resource setting (WMT '14 English-German), SCALENORM and FIXNORM remain competitive but PRENORM degrades performance.

Files

IWSLT2019_paper_26.pdf

Files (345.9 kB)

Name	Size	Download all
IWSLT2019_paper_26.pdf md5:5a6c18ef21719ddacdab79deec9a4b39	345.9 kB	Preview Download

	All versions	This version
Views	1,230	1,223
Downloads	754	752
Data volume	297.2 MB	296.5 MB

Transformers without Tears: Improving the Normalization of Self-Attention

Creators

Description

Files

IWSLT2019_paper_26.pdf

Files (345.9 kB)