Multimodal and multi-output deep learning architectures for the automatic assessment of voice quality using the GRB scale
- 1. Universidad de Antioquia, Medellín, Colombia
- 2. Universidad Politécnica de Madrid, Spain
Description
This paper addresses the automatic assessment of voice quality according to the GRB scale, based on the use of a variety of deep learning architectures for prediction purposes. The proposed architectures are multimodal, because they employ multiples sources of information; and also multi-output, because they simultaneously predict all the traits of the GRB scale. A feature engineering approach is followed, based on the use of deep neural networks and a set of well-established features such as MFCC, perturbation and complexity characteristics. Likewise, a representation learning is considered, using convolutional neural networks feed on modulation spectra extracted from voices. Finally, diverse loss functions are also investigated, including two surrogate ordinal classification, a conventional weighed categorical cross-entropy, and a mean square error function. Experiments are carried out in a dataset containing registers of the sustained phonation of three vowels. The best deep learning architecture provides a relative performance improvement of 6.25% for G, 14.1% for R and 18.1% for B, in comparison with recently published results using the same dataset.
Files
JSTSP-PrePrint.pdf
Files
(367.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:5fcde1740b220d4f9a9f0e6aa1291598
|
367.8 kB | Preview Download |