Published November 3, 2025 | Version v1
Conference paper Open

Fine-Grained MIDI Expression Transcription from Wind and String Instrument Audio via Sim2Real Transfer Learning

Description

While MIDI velocity estimation in piano music transcription has been widely studied, similar work for other instruments remains underexplored. Unlike piano MIDI velocity, which provides note-level volume modulation, MIDI Expression (CC11) provides continuous volume modulation across a note’s duration, requiring finer temporal resolution. This paper addresses the task of estimating MIDI CC11 values from wind and string instrument audio recordings. To explore suitable estimation methods, we first investigate the numerical relationship between MIDI CC11 and audio Root Mean Square (RMS) energy. Motivated by the analysis results of the MIDI CC11–RMS relationship, we compare three estimation approaches: linear, quadratic, and BiLSTM-based deep learning. We adopt a Simulation-to-Reality (Sim2Real) strategy, training models on synthetic audio rendered from randomized MIDI CC11 curves and evaluating on real performance recordings. Unlike approaches requiring manually labeled data, ours relies entirely on synthetic training, avoiding the need for expert annotation. Experiments on violin, viola, flute, and trumpet demonstrate the effectiveness of the Sim2Real approach, with the deep learning model achieving the best performance. Using the deep learning model, we generate a MIDI dataset enriched with fine-grained MIDI CC11 annotations, which can be used for future expressive music analysis, modeling, or generation. All transcribed data are available online.

Files

CMMR2025_P1_13.pdf

Files (2.3 MB)

Name Size Download all
md5:4e2f466b48c32f5e37ed9fe78c47fca9
2.3 MB Preview Download