Published February 1, 2020 | Version v1
Conference paper Open

Towards making the most of BERT in neural machine translation

Creators

  • 1. ByteDance

Description

Can we utilize extremely large monolingual text to improve neural machine translation without the expensive back-translation? Neural machine translation models are trained on parallel bilingual corpus. Even the large ones only include 20 to 40 millions of parallel sentence pairs. In the meanwhile, pre-trained language models such as BERT and GPT are trained on usually billions of monolingual sentences. Direct use BERT as the initialization for Transformer encoder could not gain any benefit, due to the catastrophic forgetting problem of BERT knowledge during further training on MT data. This example shows how to run the CTNMT (Yang et al. 2020) training method that integrates BERT into a Transformer MT model, the first successful method to do so.

Files

ckpt.ctnmt.zip

Files (2.8 GB)

Name Size Download all
md5:34ba4f4ddd4de8db88ec0f30e5769a5a
2.8 GB Preview Download