Pretrained Joint Transformer-HiFiGAN MIDI-to-audio Piano Synthesis Model

Shi, Xuan; Cooper, Erica; Wang, Xin; Yamagishi, Junichi; Narayanan, Shrikanth

doi:10.5281/zenodo.7370009

Published November 28, 2022 | Version v1

Other Open

Pretrained Joint Transformer-HiFiGAN MIDI-to-audio Piano Synthesis Model

1. University of Southern California
2. National Institute of Informatics

This is the pretrained model for our paper submitted to ICASSP 2023:
"CAN KNOWLEDGE OF END-TO-END TEXT-TO-SPEECH MODELS IMPROVE NEURAL MIDI-TO-AUDIO SYNTHESIS SYSTEMS?"
Xuan Shi, Erica Cooper, Xin Wang, Junichi Yamagishi, Shrikanth Narayanan
https://arxiv.org/abs/2211.13868
Please cite this paper if you use this pretrained model.

This pretrained model goes with the code found here:
https://github.com/nii-yamagishilab/midi-to-audio
See that codebase's README for more information about dependencies etc.

The code for training this model was based on the ESPnet-TTS project:
"ESPnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit," ICASSP 2020
Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, and Xu Tan

The data used to train this model was trained using the MAESTRO dataset:
"Enabling factorized piano music modeling and generation with the MAESTRO dataset," ICLR 2019
Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, JesseEngel, and Douglas Eck

This model consists of a MIDI-to-mel component based on Transformer-TTS:
"Neural speech synthesis with transformer network," AAAI 2019
Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu

and a HiFiGAN-based mel-to-audio component:
"HiFi-GAN: Generative Adversarial Networks for Efficient and High
Fidelity Speech Synthesis," NIPS 2020
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae

The two components were first separately trained, and then jointly fine-tuned for an additional 200K steps.

COPYING
This pretrained model is licensed under the Creative Commons License: Attribution 4.0 International
http://creativecommons.org/licenses/by/4.0/legalcode
Please see `LICENSE.txt` for the terms and conditions of this pretrained model.

ACKNOWLEDGMENTS
This study is supported by the Japanese-French joint national project called VoicePersonae, JST CREST (JPMJCR18A6, JPMJCR20D3), MEXT KAKENHI Grants (21K17775, 21H04906, 21K11951), Japan, and Google AI for Japan program.

Files

LICENSE.txt

Files (380.2 MB)

Name	Size	Download all
LICENSE.txt md5:b93e4b3cea75cb7abe624b38dab8d8d3	17.4 kB	Preview Download
README.txt md5:adba83a4f23d90d891ceae9a43f74e86	2.1 kB	Preview Download
valid.text2mel_loss.best.pth.tar.gz md5:22682274a4e49b1815973f7797ea6b7b	380.2 MB	Download

	All versions	This version
Views	218	113
Downloads	70	34
Data volume	12.9 GB	5.7 GB

Pretrained Joint Transformer-HiFiGAN MIDI-to-audio Piano Synthesis Model

Creators

Description

Files

LICENSE.txt

Files (380.2 MB)