Speech Emotion Recognition using Deep Learning

Emotions are mental state brought on by neurophysiological adjustments that are sporadically connected to thoughts, feelings, actions, and a degree of pleasure or irritation. Using the CNN model and MFCC feature extractions, we are implementing a method in this work for determining the underlying emotion in voice data. Various databases' worth of information was combined to enhance the voice samples. 2-D log The voice signal's recovered mel-spectrograms (static, delta, delta, and delta) were fed into CNN as input. segment-level characteristics are stacked that CNN extracted, the utterance-level characteristics were acquired. The CNN model was then utilized in the final Speech Emotion Recognition (SER) system. The proposed method with popular SER (Speech Emotion Recognition) have outperformed than the existing methods.


INTRODUCTION
people's life depends heavily on their emotions speech emotion recognition is the study of vocal behavior with an emphasis on speech's nonverbal components. its fundamental premise is that there are statistically measurable aspects of voice that can be used to gauge an individual's emotional state. audio emotion expression hasn't gotten as much attention as its face counterpart thus far.
Speech emotion recognition in the area of human-computer interaction (ser) is seen as an important initiative (hci) [1].the main goal of this paper is to create a model for extracting emotions from speech, a task that is fraught with difficulties. speechbased technology is gaining popularity and acceptance steadily and has started to become more common. human-computer interaction technologies have advanced as a result of the developments and improvements in artificial intelligence (hci). For instance, speech signals can typically be acquired more quickly and affordably thanmany other biological signals (such as the ekg). due to their complexity and intricacy, speech emotion recognition (ser) systems are demanding and difficult to implementation of extraction of features is seen as a significant problem when it comes to speech emotion recognition systems. Significant speech qualities, such as pitch, energy, and frequency, that convey information about the speaker's emotions have been postulated by numerous academics. Classification, the speech emotion detection system's last step, frequently entails categorising the basic input audio data. into frames there are several applications for identifying and understanding the emotion expressed in an audio signal in the modern era. applications for the Internet of Things (IoT), such google home, amazon alexa, and mycroft, operate with speech-based inputs. voice technology plays a critical part in internet of things applications. a new and exciting area of study is speech emotion recognition(SER). One application that uses voice commands to operate many of its functions is selfdriving or autonomous vehicles. The term "speech emotion recognition system" refers to a technology that can identify emotional characteristics in an audio input.

LITERATURE REVIEW
Shahin, A. B. Nassif and S. Hamsa, [1] proposed a novel device for emotion-related speech notifications. Prior to informing and evaluating the new technology, this strategy split the emotions into six different groups. Cepstral coefficients of Mel-frequency (MFCC) and LFPC are compared in order to evaluate the overall performance of the suggested technique. Linear Prediction Cepstral Coefficients (LPCC). The results demonstrate the mutual and first-class type accuracy respectively 78% and 96%. Results also show the LFPC was a better choice for an emotion type characteristic than the usual functions.

Speech Emotion Recognition using Deep Learning
W. Dai, D. Han, Y. Dai, and D. Xu,. [2] For the popularity of the voice difficulty category, a Gaussian aggregated vector autoregressive (GMVAR) approach that is an aggregation of a GMM with vector auto -regressive was presented. The main concept of GMVAR is its capacity to facilitate a large portion of their dispersion and lay out the interdependence between speech function set. GMVAR estimation now uses the Berlin emotional dataset. According to the experimental findings, category accuracy is 76%, compared to 71% for HMM, 67% for ok-NN, and 55% for feed-forward neural networks. Higher discrimination between excess and coffee excitement with impartial sentiments compared to HMM is a benefit of this method. Swain, M.; Routray, A.; Kabisatpathy, P. [3] presented a novel modulating spectral functions (MSFs) for the popularity of human voice emotions. An appropriate characteristic may be retrieved from an auditory-stimulated long-term spectro-temporal by using an audio filter bank for speech decomposition and a modulation filter bank. This method used temporal modulation frequency and acoustic frequency additions to deliver important information that is missing from traditional short-time period spectral functions. The type procedure employs SVM with the Radial Foundation Characteristic (RBF). As a result of the experiment, the MSFs display acceptable overall performance in assessment utilizing MFCC and Perceptive Linear Prediction Coefficients (PLPC). There is an increase in prosodic functions when MSFs are used. a sizable development in overall performance of popularity. Furthermore, normal popularity fee of91.6% is completed for type.
Singh, A.K.; and Gupta. To sense sentiments, [4] create a hierarchical computational form. The speech sign is mapped into one of the associated emotion trainings using this approach through the subsequent layers of binary classifications. The main goal of the different levels of a tree is to solve the category task simply in order to stop errors from spreading. In order to evaluate the category approach, IEMOCAP datasets are used. In comparison to the baseline SVM, the final result improves accuracy by a factor of 72.44% to 89.58%. The findings demonstrate that the aforementioned hierarchical method is effective for categorizing emotional speech in various datasets.
Basu and Aftabuddin, M. [5] provides us a reasonably moderate and comprehensive study the systems of voice of databases, features, classifiers, and emotional models.. AnkurSapra, Nikhil Panwar, Sohan Panwar. [6] issued a brief overview of the different methods available in speech emotion recognition. The research has offered a review of techniques utilized for voice emotion recognition. The depth of the research is the paper's flaw.
AasthaJoshi. [7] reviewed studies that were published between the years 2000 and 2017. He reviewed the SER systems that are specifically based on use of the classifiers, feature extraction, and database. A significant section of the study is devoted to However, only traditional computer software has been taken into account as a classification method after feature extraction.

PROBLEM DEFINITION
The primary goal is to determine a person's emotional state by processing acoustic signals. Using the Convolution Neural Network (CNN) algorithm and MFCC feature extraction, a SER system can recognize human emotions. The paper's primary goal is to improve the SER's performance and capabilities so that it can produce reliable findings with fewer false positives

EXISTING SYSTEM
The research that has already been done in this field shows that the majority of the current work depends on text processing for emotion recognition, which has been used to classify emotions into three groups: angry, happy, and neutral. Usually, the major internal parameter to determine the type of emotion is the level of similarity in between training data and the testing data. The second method exclusively recognizes segments of angry, happy, and neutral emotion and uses extraction of features of discriminatory characteristics with both the Vector Support Machine (SVM) algorithm classifier.

PROPOSED SYSTEM
In the proposed system, Mel-frequency Cepstral Coefficient (MFCC) feature has been utilized to classify the data into different emotion groups. CNN is widely used for pattern recognition as it has many features like Mel Frequency Cepstral Coefficients (MFCC), and also has a relatively simple structure and uses less parameters for training the model, thus making it ideal for SER. This technique manages to accomplish and also helps in establishing a suitable compromise between the realtime process's performance precision and computing volume. As a machine learning model, the Speech Emotion Recognition (SER) system was developed.

IMPLEMENTATION
The process's crucial phase is regarded as feature extraction. It comes after the first step. Here, a variety of machine learning tasks are carried out on the gathered datasets. These techniques take care of the problems with data representation and data quality. The provided input is a sample of audio. An algorithmic based model is constructed with the generated sampling rate value using the librosa package and Mel Frequency Cepstrum Coefficients (MFCC) function in the third stage, which is frequently thought of as the core of a machine learning project. The data is then divided into training and testing sets, and a convolution neural network (CNN) model with successive layers was built to train the dataset. Utilizing Keras, the Convolution Neural Network (CNN) model is developed. This model trains itself to react to whatever new data it is exposed to using a machine learning technique to learn about the data. The gathered data would originally be divided into data used for testing and training. The Mel Frequency Cepstrum Coefficients (MFCC) values closely relate to the speaker's emotional tone among all the features extracted from the audio data. Results show that the training dataset's dimensionality may be significantly reduced by employing the Mel Frequency Cepstrum Coefficients (MFCC) features, which also lowers the overall The model's time for computation. Speech is represented as a three-layered picture. Consider the first and second derivatives of speech image with time and frequency while using CNN. Convolution Neural Network (CNN) is able to forecast, analyze voice data, learn from speeches, and recognize words or utterances.The built-in model must be evaluated and assessed for performance as the last phase. The process of creating a model and assessing it is regularly repeated by developers in order to compare its performance to that of various algorithms. Results of comparisons assist in selecting the best applicable machine learning method for the given task.

SYSTEM IMPLEMENTATION
Initially, we will look at how a speech emotion model is developed and trained using various speech emotion datasets that are available on the internet The model created will be trained using the data gathered, along with all the choices and outcomes that a model can produce. will produce is guided by the data.
In the past, we cleaned the original data to make it ready for creating and refining the appropriate Machine Learning models.
Step 2 involved pre-processing the data to remove noise. During the training process, preprocessing techniques like end consonants and quotation mark removal can enhance the quality of the dataset. In order to solve the problems with collected data and data quality, the pre-processed data is characterized. A crucial and significant step in the process is termed feature extraction. A Fully Convolutional Network and its subsequent layers are trained using Keras.
We then Separated the data into training data and testing data. Finally, the emotion depicted in the audio signal is predicted and displayed as output in the software.  This study has demonstrated how machine learning can be used to extract the underlying sentiments from a speech audio stream and provided some insights on how people communicate their emotions verbally. This approach is applicable in many other contexts, including call centers for customer service or marketing, voice-based chatbots or virtual assistants, linguistic research, and many more.