ESPnet How2 Speech Translation System for IWSLT 2019: Pre-training, Knowledge Distillation, and Going Deeper

10.5281/zenodo.3525560 https://zenodo.org/records/3525560 oai:zenodo.org:3525560 Inaguma, Hirofumi Hirofumi Inaguma Kyoto University Kiyono, Shun Shun Kiyono RIKEN AIP & Tohoku University Soplin, Nelson Enrique Yalta Nelson Enrique Yalta Soplin Waseda University Suzuki, Jun Jun Suzuki Tohoku University & RIKEN AIP Duh, Kevin Kevin Duh Johns Hopkins University Watanabe, Shinji Shinji Watanabe Johns Hopkins University ESPnet How2 Speech Translation System for IWSLT 2019: Pre-training, Knowledge Distillation, and Going Deeper Zenodo 2019 2019-11-02 2019-11-08 eng 10.5281/zenodo.3525559 https://zenodo.org/communities/iwslt2019 This paper describes the ESPnet submissions to the How2 Speech Translation task at IWSLT2019. In this year, we mainly build our systems based on Transformer architectures in all tasks and focus on the end-to-end speech translation (E2E-ST). We first compare RNN-based models and Transformer, and then confirm Transformer models significantly and consistently outperform RNN models in all tasks and corpora. Next, we investigate pre-training of E2E-ST models with the ASR and MT tasks. On top of the pre-training, we further explore knowledge distillation from the NMT model and the deeper speech encoder, and confirm drastic improvements over the baseline model. All of our codes are publicly available in ESPnet.