MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling

Jingjing Tang; Xin Wang; Zhe Zhang; Junichi Yamagish; Geraint Wiggins; George Fazekas

doi:10.5281/zenodo.17706539

There is a newer version of the record available.

Published September 21, 2025 | Version v1

Conference paper Open

MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling

Generating expressive audio performances from music scores requires models to capture both instrument acoustics and human interpretation. Traditional music performance synthesis pipelines follow a two-stage approach, first generating expressive performance MIDI from a score, then synthesising the MIDI into audio. However, these systems often struggle to generalise across diverse MIDI sources, musical styles, and recording environments. To address these challenges, we propose MIDI-VALLE, a neural codec language model adapted from the VALLE framework, which was originally designed for zero-shot personalised text-to-speech (TTS) synthesis. For performance MIDI-to-audio synthesis, we improve the architecture to condition on a reference audio performance and its corresponding MIDI representation. Unlike previous TTS-based systems that rely on piano rolls, MIDI-VALLE encodes both MIDI and audio as discrete tokens, facilitating a more consistent and robust modelling of piano performances. Furthermore, the model's generalisation ability is enhanced by training on an extensive and diverse piano performance dataset. Evaluation results show that MIDI-VALLE significantly outperforms a state-of-the-art baseline, achieving over 75% lower Fréchet Audio Distance on the ATEPP and Maestro datasets. In the listening test, MIDI-VALLE received 202 votes compared to 58 for the baseline, demonstrating improved synthesis quality and generalisation across diverse performance MIDI inputs.

Files

000072.pdf

Files (595.3 kB)

Name	Size	Download all
000072.pdf md5:f0389094abd6caadd0d6c2d5d6b85eb8	595.3 kB	Preview Download

181

Views

Downloads

Show more details

	All versions	This version
Views	181	98
Downloads	99	74
Data volume	61.3 MB	45.8 MB

More info on how stats are collected....

DOI

Resource type

Conference paper

Publisher

ISMIR

Imprint

Proceedings of the 26th International Society for Music Information Retrieval Conference, 637-644. Daejeon, South Korea.

Conference

International Society for Music Information Retrieval Conference (ISMIR 2025) , Daejeon, South Korea and Online, September 21-25, 2025

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: November 25, 2025
Modified: November 25, 2025

MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling

Authors/Creators

Description

Files

000072.pdf

Files (595.3 kB)