SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis

doi:10.5281/zenodo.7119400

Published September 30, 2022 | Version v1

Dataset Open

SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis

1. Innoetics, Samsung Electronics, Greece
2. Mobile Communications Business, Samsung Electronics, Republic of Korea

This is the public release of the Samsung Open Mean Opinion Scores (SOMOS) dataset for the evaluation of neural text-to-speech (TTS) synthesis, which consists of audio files generated with a public domain voice from trained TTS models based on bibliography, and numbers assigned to each audio as quality (naturalness) evaluations by several crowdsourced listeners.

Description

The SOMOS dataset contains 20,000 synthetic utterances (wavs), 100 natural utterances and 374,955 naturalness evaluations (human-assigned scores in the range 1-5). The synthetic utterances are single-speaker, generated by training several Tacotron-like acoustic models and an LPCNet vocoder on the LJ Speech voice public dataset. 2,000 text sentences were synthesized, selected from Blizzard Challenge texts of years 2007-2016, the LJ Speech corpus as well as Wikipedia and general domain data from the Internet.
Naturalness evaluations were collected via crowdsourcing a listening test on Amazon Mechanical Turk in the US, GB and CA locales. The records of listening test participants (workers) are fully anonymized. Statistics on the reliability of the scores assigned by the workers are also included, generated through processing the scores and validation controls per submission page.

To listen to audio samples of the dataset, please see our Github page.

The dataset release comes with a carefully designed train-validation-test split (70%-15%-15%) with unseen systems, listeners and texts, which can be used for experimentation on MOS prediction.

Terms of use

The dataset may be used for research purposes only, for non-commercial purposes only, and may be distributed with the same terms.
Every time you produce research that has used this dataset, please cite the dataset appropriately.

Cite as:

@inproceedings{maniati22_interspeech,
  author={Georgia Maniati and Alexandra Vioni and Nikolaos Ellinas and Karolos Nikitaras and Konstantinos Klapsas and June Sig Sung and Gunu Jho and Aimilios Chalamandaris and Pirros Tsiakoulis},
  title={{SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={2388--2392},
  doi={10.21437/Interspeech.2022-10922}
}

References of resources & models used

Voice & synthesized texts:
K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.

Vocoder:
J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthesis through linear prediction,” in Proc. ICASSP, 2019.
R. Vipperla, S. Park, K. Choo, S. Ishtiaq, K. Min, S. Bhattacharya, A. Mehrotra, A. G. C. P. Ramos, and N. D. Lane, “Bunched lpcnet: Vocoder for low-cost neural text-to-speech systems,” in Proc. Interspeech, 2020.

Acoustic models:
N. Ellinas, G. Vamvoukakis, K. Markopoulos, A. Chalamandaris, G. Maniati, P. Kakoulidis, S. Raptis, J. S. Sung, H. Park, and P. Tsiakoulis, “High quality streaming speech synthesis with low, sentence-length-independent latency,” in Proc. Interspeech, 2020.
Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards End-to-End Speech Synthesis,” in Proc. Interspeech, 2017.
J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions,” in Proc. ICASSP, 2018.
J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, and Y. Wu, “Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling,” arXiv preprint arXiv:2010.04301, 2020.
M. Honnibal and M. Johnson, “An Improved Non-monotonic Transition System for Dependency Parsing,” in Proc. EMNLP, 2015.
M. Dominguez, P. L. Rohrer, and J. Soler-Company, “PyToBI: A Toolkit for ToBI Labeling Under Python,” in Proc. Interspeech, 2019.
Y. Zou, S. Liu, X. Yin, H. Lin, C. Wang, H. Zhang, and Z. Ma, “Fine-grained prosody modeling in neural speech synthesis using ToBI representation,” in Proc. Interspeech, 2021.
K. Klapsas, N. Ellinas, J. S. Sung, H. Park, and S. Raptis, “WordLevel Style Control for Expressive, Non-attentive Speech Synthesis,” in Proc. SPECOM, 2021.
T. Raitio, R. Rasipuram, and D. Castellani, “Controllable neural text-to-speech synthesis using intuitive prosodic features,” in Proc. Interspeech, 2020.

Synthesized texts from the Blizzard Challenges 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2016:
M. Fraser and S. King, "The Blizzard Challenge 2007," in Proc. SSW6, 2007.
V. Karaiskos, S. King, R. A. Clark, and C. Mayo, "The Blizzard Challenge 2008," in Proc. Blizzard Challenge Workshop, 2008.
A. W. Black, S. King, and K. Tokuda, "The Blizzard Challenge 2009," in Proc. Blizzard Challenge, 2009.
S. King and V. Karaiskos, "The Blizzard Challenge 2010," 2010.
S. King and V. Karaiskos, "The Blizzard Challenge 2011," 2011.
S. King and V. Karaiskos, "The Blizzard Challenge 2012," 2012.
S. King and V. Karaiskos, "The Blizzard Challenge 2013," 2013.
S. King and V. Karaiskos, "The Blizzard Challenge 2016," 2016.

Contact

Georgia Maniati - g.maniati@samsung.com

If you have any questions or comments about the dataset, please feel free to write to us.
We are interested in knowing if you find our dataset useful! If you use our dataset, please email us and tell us about your research.

Files

somos.zip

Files (4.0 GB)

Name	Size	Download all
somos.zip md5:7728fa20cfe978c56370efac8c7ffffc	4.0 GB	Preview Download

Additional details

Is described by: Conference paper: 10.21437/Interspeech.2022-10922 (DOI)

K. Ito and L. Johnson, "The LJ Speech Dataset," https://keithito.com/LJ-Speech-Dataset/, 2017.
J.-M. Valin and J. Skoglund, "LPCNet: Improving neural speech synthesis through linear prediction," in Proc. ICASSP, 2019.
R. Vipperla, S. Park, K. Choo, S. Ishtiaq, K. Min, S. Bhattacharya, A. Mehrotra, A. G. C. P. Ramos, and N. D. Lane, "Bunched lpcnet: Vocoder for low-cost neural text-to-speech systems," in Proc. Interspeech, 2020.
N. Ellinas, G. Vamvoukakis, K. Markopoulos, A. Chalamandaris, G. Maniati, P. Kakoulidis, S. Raptis, J. S. Sung, H. Park, and P. Tsiakoulis, "High quality streaming speech synthesis with low, sentence-length-independent latency," in Proc. Interspeech, 2020.
Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., "Tacotron: Towards End-to-End Speech Synthesis," in Proc. Interspeech, 2017.
J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., "Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions," in Proc. ICASSP, 2018.
J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, and Y. Wu, "Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling," arXiv preprint arXiv:2010.04301, 2020.
M. Honnibal and M. Johnson, "An Improved Non-monotonic Transition System for Dependency Parsing," in Proc. EMNLP, 2015.
M. Dominguez, P. L. Rohrer, and J. Soler-Company, "PyToBI: A Toolkit for ToBI Labeling Under Python," in Proc. Interspeech, 2019.
Y. Zou, S. Liu, X. Yin, H. Lin, C. Wang, H. Zhang, and Z. Ma, "Fine-grained prosody modeling in neural speech synthesis using ToBI representation," in Proc. Interspeech, 2021.
K. Klapsas, N. Ellinas, J. S. Sung, H. Park, and S. Raptis, "WordLevel Style Control for Expressive, Non-attentive Speech Synthesis," in Proc. SPECOM, 2021.
T. Raitio, R. Rasipuram, and D. Castellani, "Controllable neural text-to-speech synthesis using intuitive prosodic features," in Proc. Interspeech, 2020.
M. Fraser and S. King, "The Blizzard Challenge 2007," in Proc. SSW6, 2007.
V. Karaiskos, S. King, R. A. Clark, and C. Mayo, "The Blizzard Challenge 2008," in Proc. Blizzard Challenge Workshop, 2008.
A. W. Black, S. King, and K. Tokuda, "The Blizzard Challenge 2009," in Proc. Blizzard Challenge, 2009.
S. King and V. Karaiskos, "The Blizzard Challenge 2010," 2010.
S. King and V. Karaiskos, "The Blizzard Challenge 2011," 2011.
S. King and V. Karaiskos, "The Blizzard Challenge 2012," 2012.
S. King and V. Karaiskos, "The Blizzard Challenge 2013," 2013.
S. King and V. Karaiskos, "The Blizzard Challenge 2016," 2016.

	All versions	This version
Views	3,183	1,549
Downloads	516	175
Data volume	4.2 TB	1.0 TB

SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis

Creators

Description

Files

somos.zip

Files (4.0 GB)

Additional details

Related works

References