Pretrained Joint Transformer-HiFiGAN MIDI-to-audio Piano Synthesis Model
- 1. University of Southern California
- 2. National Institute of Informatics
Description
This is the pretrained model for our paper submitted to ICASSP 2023:
"CAN KNOWLEDGE OF END-TO-END TEXT-TO-SPEECH MODELS IMPROVE NEURAL MIDI-TO-AUDIO SYNTHESIS SYSTEMS?"
Xuan Shi, Erica Cooper, Xin Wang, Junichi Yamagishi, Shrikanth Narayanan
https://arxiv.org/abs/2211.13868
Please cite this paper if you use this pretrained model.
This pretrained model goes with the code found here:
https://github.com/nii-yamagishilab/midi-to-audio
See that codebase's README for more information about dependencies etc.
The code for training this model was based on the ESPnet-TTS project:
"ESPnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit," ICASSP 2020
Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, and Xu Tan
The data used to train this model was trained using the MAESTRO dataset:
"Enabling factorized piano music modeling and generation with the MAESTRO dataset," ICLR 2019
Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, JesseEngel, and Douglas Eck
This model consists of a MIDI-to-mel component based on Transformer-TTS:
"Neural speech synthesis with transformer network," AAAI 2019
Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu
and a HiFiGAN-based mel-to-audio component:
"HiFi-GAN: Generative Adversarial Networks for Efficient and High
Fidelity Speech Synthesis," NIPS 2020
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae
The two components were first separately trained, and then jointly fine-tuned for an additional 200K steps.
COPYING
This pretrained model is licensed under the Creative Commons License: Attribution 4.0 International
http://creativecommons.org/licenses/by/4.0/legalcode
Please see `LICENSE.txt` for the terms and conditions of this pretrained model.
ACKNOWLEDGMENTS
This study is supported by the Japanese-French joint national project called VoicePersonae, JST CREST (JPMJCR18A6, JPMJCR20D3), MEXT KAKENHI Grants (21K17775, 21H04906, 21K11951), Japan, and Google AI for Japan program.