Audio Files for "Characterizing Sustained Phonation in Text-To-Speech Models" Daum et al. (2026)

Daum, Amelie

doi:10.5281/zenodo.19339995

Published June 14, 2026 | Version v1

Video/Audio Open

Audio Files for "Characterizing Sustained Phonation in Text-To-Speech Models" Daum et al. (2026)

Daum, Amelie (Data collector)

see https://doi.org/10.1016/j.jvoice.2026.05.021

Sustained phonation (SP) is a central task in clinical voice assessment and provides a controlled setting to quantify acoustic voice characteristics. In contrast, the evaluation of modern text-to-speech (TTS) systems still relies predominantly on perceptual ratings such as the mean opinion score, leaving open whether these systems can reliably generate SP and how their acoustic properties compare to human voices. The capability of TTS models to reproduce clinically relevant voice features remains insufficiently characterized.Here, we systematically examine SP in contemporary TTS systems and compare synthetic and human voice samples using common acoustic measures. Multiple TTS models were screened for their ability to generate sustained vowels, such as /a/. One model, namely Eleven v3 by ElevenLabs, was subsequently analyzed in detail with respect to the distribution of phonation durations, the relationship between prompt length and generated duration, and differences between vowels and speaker types. Finally, TTS-generated SPs were compared with human recordings from two independent cohorts using established clinical voice parameters.We found that TTS systems were able to produce SP, although reliability varied between models. For the selected Eleven v3 model, phonation durations showed non-normal distributions and were partially predicted by prompt length. Most acoustic measures of synthetic samples overlapped with the ranges observed in human voices, while selected parameters showed statistically significant but inconsistent differences across vowels. These findings indicate that current TTS models can approximate key acoustic characteristics of SP, while also exhibiting systematic deviations that should be considered in applications involving clinical voice metrics and in further development of realistic TTS systems.

Files

experiment4.zip

Files (206.2 MB)

Name	Size	Download all
experiment1.zip md5:81c8a726d65c1452345098cec8360540	29.7 MB	Preview Download
experiment2.zip md5:507668f881d86fcd91e4a5cac3b55df6	114.5 MB	Preview Download
experiment3.zip md5:14e3a977c1fc675ee9fc87b27f828157	44.6 MB	Preview Download
experiment4.zip md5:62f19af1d3e16ddbcd6c8926c966d862	17.3 MB	Preview Download

Additional details

Is supplement to: Publication: 10.1016/j.jvoice.2026.05.021 (DOI)

	All versions	This version
Views	6	6
Downloads	5	5
Data volume	444.8 MB	444.8 MB

Audio Files for "Characterizing Sustained Phonation in Text-To-Speech Models" Daum et al. (2026)

Authors/Creators

Description

Files

experiment4.zip

Files (206.2 MB)

Additional details

Related works