Published November 28, 2019 | Version v1
Journal article Open

Multimodal and multi-output deep learning architectures for the automatic assessment of voice quality using the GRB scale

  • 1. Universidad de Antioquia, Medellín, Colombia
  • 2. Universidad Politécnica de Madrid, Spain

Description

This paper addresses the automatic assessment of voice quality according to the GRB scale, based on the use of a variety of deep learning architectures for prediction purposes. The proposed architectures are multimodal, because they employ multiples sources of information; and also multi-output, because they simultaneously predict all the traits of the GRB scale. A feature engineering approach is followed, based on the use of deep neural networks and a set of well-established features such as MFCC, perturbation and complexity characteristics. Likewise, a representation learning is considered, using convolutional neural networks feed on modulation spectra extracted from voices. Finally, diverse loss functions are also investigated, including two surrogate ordinal classification, a conventional weighed categorical cross-entropy, and a mean square error function. Experiments are carried out in a dataset containing registers of the sustained phonation of three vowels. The best deep learning architecture provides a relative performance improvement of 6.25% for G, 14.1% for R and 18.1% for B, in comparison with recently published results using the same dataset.

Files

JSTSP-PrePrint.pdf

Files (367.8 kB)

Name Size Download all
md5:5fcde1740b220d4f9a9f0e6aa1291598
367.8 kB Preview Download