Deep learning techniques for speech emotion recognition: A review

Silviana Widya Lestari; Saliyah Kahar; Trismayanti Dwi

doi:10.5281/zenodo.8139722

Published June 30, 2023 | Version v1

Journal article Open

Deep learning techniques for speech emotion recognition: A review

1. Management & Science University, Malaysia
2. State Polytechnic of Jember, Indonesia

Speech emotion recognition is gaining significant importance in the domains of pattern recognition and natural language processing. In recent years, there has been notable progress in voice emotion detection within this field, primarily attributed to the successful application of deep learning techniques. Some research in this area lacks a thorough comparative study of different deep learning models and techniques related to speech emotion detection. This makes it difficult to identify the best performing approaches and their relative strengths and weaknesses. Therefore, the purpose of this work is to provide a comprehensive overview and provide a detailed overview of deep learning methods for speech emotion detection. The method used is a comparative literature analysis of previous articles that are relevant to the topic, which are related to both the methods of deep learning and the collections of data. The datasets that to be analyzed include the EMO-DB, RAVDESS, TESS, CREMA-D, IEMOCAP, and Danish Emotional Speech Databases. The language that used in the dataset is English, except for EMO-DB which used German language and Danish Emotional Speech Database that used Danish language. Most of the emotion types extracted from these datasets included basic emotions such as happiness, sadness, neutrality, disgust, surprise, and anger. The results of this review show that the application of deep learning techniques has made significant progress in the introduction of speech emotion detection. Complex deep learning models, for instance the CNN-RNN combination, can extract relevant acoustic features and produce accurate results in recognizing emotion from speech. This advancement has significant implications for various applications, including human computer interaction, affective computing, call center analytics, psychological research, and clinical diagnosis.

Files

IRJSTEM_V3N2_2023_P07.pdf

Files (479.4 kB)

Name	Size	Download all
IRJSTEM_V3N2_2023_P07.pdf md5:2f95cb58e5727e1ce40d561857cea915	479.4 kB	Preview Download

Additional details

Abbaschian, B.J., Sierra-Sosa, D., & Elmaghraby, A. (2021). Deep learning techniques for speech emotion recognition, from databases to models. In Sensors (Switzerland), 21(4), 1-27. MDPI AG. https://doi.org/10.3390/s21041249
Akçay, M.B. & Oğuz, K. (2020). Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication, 116(June 2019), 56–76. https://doi.org/10.1016/j.specom.2019.12.001
Anvarjon, T., Mustaqeem, & Kwon, S. (2020). Deep-net: A lightweight cnn-based speech emotion recognition system using deep frequency features. Sensors (Switzerland), 20(18), 1–16. https://doi.org/10.3390/s20185212
Aouani, H. & Ayed, Y.B. (2020). Speech Emotion Recognition with deep learning. Procedia Computer Science, 176, 251–260. https://doi.org/10.1016/j.procs.2020.08.027
Balomenos, T., Raouzaiou, A., Ioannou, S., Drosopoulos, A., Karpouzis, K., & Kollias, S. (2005). Emotion analysis in man-machine interaction systems. Lecture Notes in Computer Science, 3361, 318–328. https://doi.org/10.1007/978-3-540-30568-2_27
Bertero, D. & Fung, P. (2017). A first look into a convolutional neural network for speech emotion detection. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 5115-5119). Clear Water. http://ieeexplore.ieee.org/document/7953131/
Byun, S.W. & Lee, S.P. (2021). A study on a speech emotion recognition system with effective acoustic features using deep learning algorithms. Applied Sciences (Switzerland), 11(4), 1–15. https://doi.org/10.3390/app11041890
Chatziagapi, A., Paraskevopoulos, G., Sgouropoulos, D., Pantazopoulos, G., Nikandrou, M., Giannakopoulos, T., Katsamanis, A., Potamianos, A., & Narayanan, S. (2019). Data augmentation using GANs for speech emotion recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-Septe, 171–175. https://doi.org/10.21437/Interspeech.2019-2561
Chen, M., Zhou, P., & Fortino, G. (2017). Emotion Communication System. IEEE Access, 5, 326–337. https://doi.org/10.1109/ACCESS.2016.2641480
Douglas-cowie, E., Cowie, R., & Schröder, M. (2000). A New Emotion Database: Considerations, Sources and Scope. In, 39–44.
Eskimez, S. E., Duan, Z., & Heinzelman, W. (2018). Unsupervised learning approach to feature analysis for automatic speech emotion recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5099-5103). http://ieeexplore.ieee.org/document/8462417/
Ho, N. H., Yang, H. J., Kim, S. H., & Lee, G. (2020). Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network. IEEE Access, 8, 61672–61686. https://doi.org/10.1109/ACCESS.2020.2984368
Hubel, D. H. & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of Physiology, 195(1), 215–243. https://doi.org/10.1113/jphysiol.1968.sp008455
Latif, S., Rana, R., & Qadir, J. (2018). Adversarial Machine Learning And Speech Emotion Recognition: Utilizing Generative Adversarial Networks For Robustness. 1–7. http://arxiv.org/abs/1811.11402
Latif, S., Rana, R., Qadir, J., & Epps, J. (2017). Variational Autoencoders for Learning Latent Representations of Speech Emotion: A Preliminary Study. http://arxiv.org/abs/1712.08708
Lee, J. & Tashev, I. (2015). High-level feature representation using recurrent neural network for speech emotion recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2015-Janua, 1537–1540. https://doi.org/10.21437/interspeech.2015-336
Lieskovská, E., Jakubec, M., Jarina, R., & Chmulík, M. (2021). A review on speech emotion recognition using deep learning and attention mechanism. Electronics (Switzerland), 10(10). https://doi.org/10.3390/electronics10101163
Nassif, A. B., Shahin, I., Attili, I., Azzeh, M., & Shaalan, K. (2019). Speech Recognition Using Deep Neural Networks: A Systematic Review. IEEE Access, 7, 19143–19165. https://doi.org/10.1109/ACCESS.2019.2896880
Pandey, S.K., Shekhawat, H.S., & Prasanna, S.R.M. (2019). Deep learning techniques for speech emotion recognition: A review. 2019 29th International Conference Radioelektronika, RADIOELEKTRONIKA 2019 - Microwave and Radio Electronics Week, MAREW 2019. https://doi.org/10.1109/RADIOELEK.2019.8733432
Petrushin, V.A. (2000). Emotion recognition in speech signal: Experimental study, development, and application. 6th International Conference on Spoken Language Processing, ICSLP 2000, Icslp, 6–9. https://doi.org/10.21437/icslp.2000-791
Reynolds, D.A. (n.d.). 2002 Reynolds D - An overview of automatic speaker recognition.pdf.
Sahu, S., Gupta, R., & Espy-Wilson, C. (2018). On Enhancing Speech Emotion Recognition using Generative Adversarial Networks. http://arxiv.org/abs/1806.06626
Scheidwasser-Clow, N., Kegler, M., Beckmann, P., & Cernak, M. (2022). Serab: a Multi-Lingual Benchmark for Speech Emotion Recognition. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2022-May, 7697–7701. https://doi.org/10.1109/ICASSP43922.2022.9747348
Schmidhuber, J. (2015). Deep Learning in neural networks: An overview. Neural Networks, 61, 85–117. https://doi.org/10.1016/j.neunet.2014.09.003
Swain, M., Routray, A., & Kabisatpathy, P. (2018). Databases, features and classifiers for speech emotion recognition: a review. International Journal of Speech Technology, 21(1), 93–120. https://doi.org/10.1007/s10772-018-9491-z
Tarunika, K., Pradeeba, R.B., & Aruna, P. (2018, October 16). Applying Machine Learning Techniques for Speech Emotion Recognition. 2018 9th International Conference on Computing, Communication and Networking Technologies, ICCCNT 2018. https://doi.org/10.1109/ICCCNT.2018.8494104
Vogt, T., André, E., & Wagner, J. (2008). Automatic recognition of emotions from speech: A review of the literature and recommendations for practical realisation. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 4868 LNCS, 75–91. https://doi.org/10.1007/978-3-540-85099-1_7
Xie, Y., Liang, R., Liang, Z., Huang, C., Zou, C., & Schuller, B. (2019). Speech Emotion Classification Using Attention-Based LSTM. IEEE/ACM Transactions on Audio Speech and Language Processing, 27(11), 1675–1685. https://doi.org/10.1109/TASLP.2019.2925934
Yao, Z., Wang, Z., Liu, W., Liu, Y., & Pan, J. (2020). Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN. Speech Communication, 120, 11–19. https://doi.org/10.1016/j.specom.2020.03.005
Zhang, C. & Xue, L. (2021). Autoencoder with emotion embedding for speech emotion recognition. IEEE Access, 9, 51231–51241. https://doi.org/10.1109/ACCESS.2021.3069818
Zhao, J., Mao, X., & Chen, L. (2019). Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomedical Signal Processing and Control, 47, 312–323. https://doi.org/10.1016/j.bspc.2018.08.035

	All versions	This version
Views	124	116
Downloads	121	115
Data volume	64.7 MB	61.8 MB

Deep learning techniques for speech emotion recognition: A review

Creators

Description

Files

IRJSTEM_V3N2_2023_P07.pdf

Files (479.4 kB)

Additional details

References