Published May 27, 2021 | Version v1
Presentation Open

Towards an interpretable and transferable acoustic emotion recognition for the detection of mental disorders

Creators

  • 1. Technical University of Denmark
  • 1. Child and Adolescent Mental Health Center, Copenhagen University Hospital, Capital Region
  • 2. Child and Adolescent Mental Health Center, Copenhagen University Hospital, Capital Region; Faculty of Health, Department of Clinical Medicine, Copenhagen University
  • 3. Department of Applied Mathematics and Computer Science, Technical University of Denmark

Description

Motivation
Automatic speech emotion recognition (ASER) refers to a group of algorithms that deduce the emotional state of an individual from their speech utterances. The methods are deployed in a wide range of tasks, including the detection and intervention of mental disorders. State-of-the art ASER techniques have evolved from the more conventional ML based methods to the current advanced deep neural network based solutions. Despite the long history of research contributions in this domain, state-of-art methods still struggle to generalize across languages, between corpora with different recording conditions, etc. Furthermore, most of the methods lack in interpretation and transparency of the models and their decision making process. These aspects are especially crucial when the methods are deployed in applications with impact on human lives.

Contribution
Autoencoders and latent representation studies are useful tools in the exploration of interpretable and generalizable models. We present results on the benefits of using autoencoders and its variants for ASER, predominantly on emotional states like anger, sadness, happiness and the neutral state. We show that the clusters in the latent space are representative of the desired emotional clusters, although some classes of emotions are more discriminative than others. We take a step further to illustrate the use of DeepLIFT to gain insights into the feature subsets that contribute to the discriminative clustering of emotion classes in the latent space. Furthermore, we study the robustness of the methods by investigating the differences that occur in the latent representations when the underlying data conditions are modified. In other words, how the differences in the language of the corpus, recording conditions of the corpus~(acted, `in the wild') manifest in the latent space. In addition, we explore the discrete and continuous scales for their appropriateness in modelling speech emotions and their correspondence to each other.

Files

WEMS2021_Sneha_Das.mp4

Files (55.1 MB)

Name Size Download all
md5:5bd32c5ca69320d6c1e97c658766d8d3
55.1 MB Preview Download