Deep learning techniques for speech emotion recognition: A review
- 1. Management & Science University, Malaysia
- 2. State Polytechnic of Jember, Indonesia
Description
Speech emotion recognition is gaining significant importance in the domains of pattern recognition and natural language processing. In recent years, there has been notable progress in voice emotion detection within this field, primarily attributed to the successful application of deep learning techniques. Some research in this area lacks a thorough comparative study of different deep learning models and techniques related to speech emotion detection. This makes it difficult to identify the best performing approaches and their relative strengths and weaknesses. Therefore, the purpose of this work is to provide a comprehensive overview and provide a detailed overview of deep learning methods for speech emotion detection. The method used is a comparative literature analysis of previous articles that are relevant to the topic, which are related to both the methods of deep learning and the collections of data. The datasets that to be analyzed include the EMO-DB, RAVDESS, TESS, CREMA-D, IEMOCAP, and Danish Emotional Speech Databases. The language that used in the dataset is English, except for EMO-DB which used German language and Danish Emotional Speech Database that used Danish language. Most of the emotion types extracted from these datasets included basic emotions such as happiness, sadness, neutrality, disgust, surprise, and anger. The results of this review show that the application of deep learning techniques has made significant progress in the introduction of speech emotion detection. Complex deep learning models, for instance the CNN-RNN combination, can extract relevant acoustic features and produce accurate results in recognizing emotion from speech. This advancement has significant implications for various applications, including human computer interaction, affective computing, call center analytics, psychological research, and clinical diagnosis.
Files
IRJSTEM_V3N2_2023_P07.pdf
Files
(479.4 kB)
Name | Size | Download all |
---|---|---|
md5:2f95cb58e5727e1ce40d561857cea915
|
479.4 kB | Preview Download |
Additional details
References
- Abbaschian, B.J., Sierra-Sosa, D., & Elmaghraby, A. (2021). Deep learning techniques for speech emotion recognition, from databases to models. In Sensors (Switzerland), 21(4), 1-27. MDPI AG. https://doi.org/10.3390/s21041249
- Akçay, M.B. & Oğuz, K. (2020). Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication, 116(June 2019), 56–76. https://doi.org/10.1016/j.specom.2019.12.001
- Anvarjon, T., Mustaqeem, & Kwon, S. (2020). Deep-net: A lightweight cnn-based speech emotion recognition system using deep frequency features. Sensors (Switzerland), 20(18), 1–16. https://doi.org/10.3390/s20185212
- Aouani, H. & Ayed, Y.B. (2020). Speech Emotion Recognition with deep learning. Procedia Computer Science, 176, 251–260. https://doi.org/10.1016/j.procs.2020.08.027
- Balomenos, T., Raouzaiou, A., Ioannou, S., Drosopoulos, A., Karpouzis, K., & Kollias, S. (2005). Emotion analysis in man-machine interaction systems. Lecture Notes in Computer Science, 3361, 318–328. https://doi.org/10.1007/978-3-540-30568-2_27
- Bertero, D. & Fung, P. (2017). A first look into a convolutional neural network for speech emotion detection. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 5115-5119). Clear Water. http://ieeexplore.ieee.org/document/7953131/
- Byun, S.W. & Lee, S.P. (2021). A study on a speech emotion recognition system with effective acoustic features using deep learning algorithms. Applied Sciences (Switzerland), 11(4), 1–15. https://doi.org/10.3390/app11041890
- Chatziagapi, A., Paraskevopoulos, G., Sgouropoulos, D., Pantazopoulos, G., Nikandrou, M., Giannakopoulos, T., Katsamanis, A., Potamianos, A., & Narayanan, S. (2019). Data augmentation using GANs for speech emotion recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-Septe, 171–175. https://doi.org/10.21437/Interspeech.2019-2561
- Chen, M., Zhou, P., & Fortino, G. (2017). Emotion Communication System. IEEE Access, 5, 326–337. https://doi.org/10.1109/ACCESS.2016.2641480
- Douglas-cowie, E., Cowie, R., & Schröder, M. (2000). A New Emotion Database: Considerations, Sources and Scope. In, 39–44.
- Eskimez, S. E., Duan, Z., & Heinzelman, W. (2018). Unsupervised learning approach to feature analysis for automatic speech emotion recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5099-5103). http://ieeexplore.ieee.org/document/8462417/
- Ho, N. H., Yang, H. J., Kim, S. H., & Lee, G. (2020). Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network. IEEE Access, 8, 61672–61686. https://doi.org/10.1109/ACCESS.2020.2984368
- Hubel, D. H. & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of Physiology, 195(1), 215–243. https://doi.org/10.1113/jphysiol.1968.sp008455
- Latif, S., Rana, R., & Qadir, J. (2018). Adversarial Machine Learning And Speech Emotion Recognition: Utilizing Generative Adversarial Networks For Robustness. 1–7. http://arxiv.org/abs/1811.11402
- Latif, S., Rana, R., Qadir, J., & Epps, J. (2017). Variational Autoencoders for Learning Latent Representations of Speech Emotion: A Preliminary Study. http://arxiv.org/abs/1712.08708
- Lee, J. & Tashev, I. (2015). High-level feature representation using recurrent neural network for speech emotion recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2015-Janua, 1537–1540. https://doi.org/10.21437/interspeech.2015-336
- Lieskovská, E., Jakubec, M., Jarina, R., & Chmulík, M. (2021). A review on speech emotion recognition using deep learning and attention mechanism. Electronics (Switzerland), 10(10). https://doi.org/10.3390/electronics10101163
- Nassif, A. B., Shahin, I., Attili, I., Azzeh, M., & Shaalan, K. (2019). Speech Recognition Using Deep Neural Networks: A Systematic Review. IEEE Access, 7, 19143–19165. https://doi.org/10.1109/ACCESS.2019.2896880
- Pandey, S.K., Shekhawat, H.S., & Prasanna, S.R.M. (2019). Deep learning techniques for speech emotion recognition: A review. 2019 29th International Conference Radioelektronika, RADIOELEKTRONIKA 2019 - Microwave and Radio Electronics Week, MAREW 2019. https://doi.org/10.1109/RADIOELEK.2019.8733432
- Petrushin, V.A. (2000). Emotion recognition in speech signal: Experimental study, development, and application. 6th International Conference on Spoken Language Processing, ICSLP 2000, Icslp, 6–9. https://doi.org/10.21437/icslp.2000-791
- Reynolds, D.A. (n.d.). 2002 Reynolds D - An overview of automatic speaker recognition.pdf.
- Sahu, S., Gupta, R., & Espy-Wilson, C. (2018). On Enhancing Speech Emotion Recognition using Generative Adversarial Networks. http://arxiv.org/abs/1806.06626
- Scheidwasser-Clow, N., Kegler, M., Beckmann, P., & Cernak, M. (2022). Serab: a Multi-Lingual Benchmark for Speech Emotion Recognition. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2022-May, 7697–7701. https://doi.org/10.1109/ICASSP43922.2022.9747348
- Schmidhuber, J. (2015). Deep Learning in neural networks: An overview. Neural Networks, 61, 85–117. https://doi.org/10.1016/j.neunet.2014.09.003
- Swain, M., Routray, A., & Kabisatpathy, P. (2018). Databases, features and classifiers for speech emotion recognition: a review. International Journal of Speech Technology, 21(1), 93–120. https://doi.org/10.1007/s10772-018-9491-z
- Tarunika, K., Pradeeba, R.B., & Aruna, P. (2018, October 16). Applying Machine Learning Techniques for Speech Emotion Recognition. 2018 9th International Conference on Computing, Communication and Networking Technologies, ICCCNT 2018. https://doi.org/10.1109/ICCCNT.2018.8494104
- Vogt, T., André, E., & Wagner, J. (2008). Automatic recognition of emotions from speech: A review of the literature and recommendations for practical realisation. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 4868 LNCS, 75–91. https://doi.org/10.1007/978-3-540-85099-1_7
- Xie, Y., Liang, R., Liang, Z., Huang, C., Zou, C., & Schuller, B. (2019). Speech Emotion Classification Using Attention-Based LSTM. IEEE/ACM Transactions on Audio Speech and Language Processing, 27(11), 1675–1685. https://doi.org/10.1109/TASLP.2019.2925934
- Yao, Z., Wang, Z., Liu, W., Liu, Y., & Pan, J. (2020). Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN. Speech Communication, 120, 11–19. https://doi.org/10.1016/j.specom.2020.03.005
- Zhang, C. & Xue, L. (2021). Autoencoder with emotion embedding for speech emotion recognition. IEEE Access, 9, 51231–51241. https://doi.org/10.1109/ACCESS.2021.3069818
- Zhao, J., Mao, X., & Chen, L. (2019). Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomedical Signal Processing and Control, 47, 312–323. https://doi.org/10.1016/j.bspc.2018.08.035