TELKOMNIKA Telecommunication Computing Electronics and Control

Received Sep 01, 2021 Revised Nov 05, 2022 Accepted Nov 15, 2022 Human-computer interactions benefit greatly from emotion recognition from speech. To promote a contact-free environment in this coronavirus disease 2019 (COVID’19) pandemic situation, most digitally based systems used speech-based devices. Consequently, this emotion detection from speech has many beneficial applications for pathology. The vast majority of speech emotion recognition (SER) systems are designed based on machine learning or deep learning models. Therefore, need greater computing power and requirements. This issue was addressed by developing traditional algorithms for feature selection. Recent research has shown that nature-inspired or evolutionary algorithms such as equilibrium optimization (EO) and cuckoo search (CS) based meta-heuristic approaches are superior to the traditional feature selection (FS) models in terms of recognition performance. The purpose of this study is to investigate the impact of feature selection meta-heuristic approaches on emotion recognition from speech. To achieve this, we selected the rayerson audio-visual database of emotional speech and song (RAVDESS) database and obtained maximum recognition accuracy of 89.64% using the EO algorithm and 92.71% using the CS algorithm. For this final step, we plotted the associated precision and F1 score for each of the emotional classes.


INTRODUCTION
There are a variety of sources of information we can use to detect emotions in people, such as speech, transcripts, facial expressions, brain signals (EEG), and a combination of two or more of these (multi-modal emotion recognition). Among these, emotional recognition from the speech is an essential element in the field of human-computer interaction. The process of speech emotion recognition involves using acoustic analysis to identify vocal changes caused by emotions and then determining which features to use to determine an emotion's presence [1]. However, many emotional databases contain either relevant or non-redundant information which can give low accuracy during classification. This issue can be addressed by applying effective feature selection (FS) methods to speech-based applications. Hence, it significantly improves the performance by the response time of the algorithm, which can turn to provide high classification accuracy. There are three main phases in the FS process. First, generate subset features from the whole set of databases, second is evaluation and finally validation [2]. As said, these traditional FS models required high computational requirements and time on speech emotional databases. There are many traditional feature selection algorithms developed for selecting relevant features for emotional classification from a speech signal. One among them is filter and wrapper approaches done based on the criterion of information gain [3], mutual information [4] and principal component analysis [5] and so on. Alternatively, in the wrapper approach, a classifier is used, such as the K-nearest neighbour (KNN) [6] and support vector machine (SVM) [7], among others, to assess the quality of the resulting subsets. At the time of the generation phase, selecting all possible features that are extracted, yields more computational efforts and computation time. Hence, the traditional FS methods are not that much impressive to speech emotion recognition (SER) tasks. Then, research is finding another way to solve this issue using a nature-inspired optimization algorithm called a meta-heuristic approach. These meta-heuristic algorithms are very intelligent search algorithms and already implemented many artificial intelligence problems [8]. Recently some researchers adopted nature-inspired meta-heuristic algorithms to improve the recognition accuracy along with fewer computational requirements. Some well-known meta-heuristic algorithms are genetic algorithm (GA), ant-colony, cuckoo search (CS), particle swarm optimization (PSO) and grey wolf optimization (GWO) employed to achieve optimal feature sub-set for speech based emotional tasks [9]. In this paper, we addressed the key concern i.e impact of feature selection models using meta-heuristic approaches for (speech emiotion recognition) SER systems. An accurate classification model requires the appropriate generation of features, the selection of features, and the use of classification methods [10]. From this background, Figure 1 shows the role of feature selection methods for speech-based emotion recognition applications.
The key contribution of this paper is summarized as studying the latest state-of-the-art meta-heuristic feature selection models for speech emotion recognition. Out of many heuristic approaches, analysis the impact of equilibrium optimization (EO) and CS algorithm for SER tasks. Finally, analyze the various performance metrics for the rayerson audio-visual database of emotional speech and song (RAVDESS) dataset towards speech emotion recognition. The rest of the paper is organized: section 2 provides the related work on SER using a meta-heuristic approach, materials such as speech emotional database used in this study describes in section 3, the methodology used for recognition of emotions from speech based on meta-heuristic focused in section 4, experimental results and analysis discussed in section 5, finally, section 6 gives conclusion and future perspective for this study.

RELATED WORK
Many academics and research centres work on automatic speech emotion recognition and concentrate more on FS algorithms to avoid computational requirements. Initially, a modified multi-objective genetic feature selection algorithm was proposed for speech emotion recognition by Brester et al. [11] and achieved improvement on F1-score as 86.37% and 67.70% for the Berlin emotional speech database (EmoDB) and surrey audio-visual expressed emotion (SAVEE) databases respectively. Unlike content-based speech recognition systems, context-independent models use only signal parameters, classifiers consider these parameters as testing and training vectors [12]. The consistency of a feature selection algorithm is generated whenever new training samples are introduced or removed [13]. The selection of features that will identify important features is influenced by stability in knowledge discovery [14]. In [15], proposed a new approach of FS model using wrapper based PSO algorithm for SER tasks and achieve recognition rate up to 78.44% for SAVEE database. One more Kozodoi et al. [16] presented a new framework for scoring credit information using genetic algorithms. Another one proposed cuckoo search in [17] and this algorithm gives an impressive result for SER tasks. Dey et al. [18] on SAVEE and EMoDB, the hybrid-based meta-heuristic  An evolutionary optimization method for selecting features for … (Kesava Rao Bagadi) 161 optimization FS model was found to achieve an accuracy of 97.31% and 98.45%, respectively. Very recently, Daneshfar et al. [19] proposed a novel approach of quntum behaved particle swarm optmization (QPSO) algorithm for emotion recognition from the speech on various datasets. Zhang [20] attempted the SER using a weighted binary cuckoo search algorithm and achieved an F1-score of 83.80%. Another in [19] proposed particle swarm optimization (PSO) based on quantum behaviour for the dimensionality reduction of speech features. Compared to state-of-the-art algorithms, this method produced more accurate results. In all these works, researchers explored the various meta-heuristic optimization algorithms for SER tasks. Studies show that it is impossible to say that any feature selection method enables SER to improve or decrease performance. Features selection methods influence the success of SER depending on the classifier, the data, and the size reduction. With this literature analysis, we will address the impact of this FS methods on speech based emotion recognition. Finally, this work uses a public related speech emotional database i.e. RAVDESS for two different optimization algorithms such as EO and CS algorithm respectively.

MATERIALS 3.1. Speech emotional database
The selection of a database is a crucial part of speech emotion recognition since the performance is determined by the naturalness of the database. In this paper, we have chosen a publicly available speech emotional database such as RAVDESS [21], which is in the English language. It contains various clipping profiles for both male and female speech samples of emotions such as anger, sadness, fear, excitement, happiness and neutral. A unique identification name is assigned to each sample in the dataset, and all samples are output as being either normal or strong in intensity. This study extracts features that contain the emotional information and selects the ones that are relevant for further processing and then classifies them using appropriate classifiers.

Feature extraction
System performance and accuracy are dependent on the signal feature extraction. The salient features of speech signals need to be extracted to identify different emotional states and speech styles. Generally, speech features are classified as acoustic features and spectral features. To analyze the speech signal, acoustic characteristics such as pitch, energy, zero crossing rates, an average Mel frequency cepstral coefficient (MFCC) as well as a discrete wavelet transform are extracted. Even in traditional or some other feature selection methods based on SER tasks, MFCCs features are one of the most prominent features to recognize emotion from speech accurately. It provides a way to characterize the properties of the voice signal. It was found that MFCC was superior in terms of speech recognition, as it helped in creating human perception compassion that takes frequencies into account. Here 12 primary discrete cosines transform (DCT) coefficients for emotions were considered as a feature vector to recognise emotions. The process of extracting MFCCs is shown in Figure 2. In this work, we extracted MFCCs features using the openSMILE tool kit [22].

Feature selection
Over-fitting of machine learning algorithms occurs when the feature set dimension is large, resulting in low performance. For machine learning, FS has the objective of reducing the dimensionality of features and reducing the cost of classification. Unlike traditional feature selection methods; here we are selecting the optimal feature subset for emotion recognition based on meta-heuristic optimization algorithms i.e. EO and CS.

Classifier
Classification involves applying a machine-learning algorithm to train a dataset as well as identifying or classifying new observations, or a test data set. In this work, we used to classify emotions using an SVM classifier. SVM is the easiest and most popular classifier. As far as classification is concerned, it creates a hyperplane between different types of data, which is an optimal boundary [23]. The strength of this SVM is not to suffer any multiple local minima. Hence, in this work, we are selected SVM as a classifier to recognize emotions from speech.

METHODOLOGY
Here, we attempted the impact of this meta-heuristic optimization algorithm like EO and CS on SER tasks. The framework of this FS model is shown in Figure 3. Initially, extract MFCCs features from the RAVDESS dataset using the openSMILE tool. Then, according to the principle of nature-inspired algorithms; first, generate the initial population of EO and CS algorithms. The purpose of this EO algorithm is to get both balancing and dynamic states from the control volume mass balance.
Considering exploration and exploitation simultaneously, it has the advantage of maintaining a good balance [24]. Dynamic mass balances of control volume systems are modelled by this algorithm. Describes the general mass balance equation in which the change in mass over time equals the mass entering a system plus the mass leaving it. A successful optimization method is cuckoo search. Yang and Deb developed the CS, one of the latest nature-inspired meta-heuristic algorithms, in 2010 [25], employing isotropic random walks, rather than by simple selection. According to recent studies, CS is potentially far more efficient than PSO. From a mathematical perspective, the success of this algorithm is to solve n-dimensional linear/non-linear optimization problems with low-level mathematics been developed in solving binary optimization problems.
Describes the general mass balance equation in which the change in mass over time equals the mass entering a system plus the mass leaving it. It is written as: Whenever the control volume ( ) is filled with concentration, there is a value. V dC dt is the volumetric flow rate, , is the change in mass in the control volume, is the equilibrium concentration in the control volume under equilibrium condition without any generation, and is the mass generation rate inside the control volume. The initial population of an EO is also determined by the size and number of particles. A randomly generated initial population is represented by (2).

=
(2) Where represents initial vectors of ℎ particle, and are optimal and maximal particle concentrations, and randi is between [0, 1] and n is the population size. Therefore, the equilibrium state concludes the optimization process since it optimizes globally.
There is no knowledge of the equilibrium state at the beginning of the optimization process, so only potential candidates can be determined. The equilibrium states of the algorithm are the highest quality and are the global optimum. Based on the results of complete optimization, these four are the best candidates. An additional particle, whose concentration equals the average of the four particles mentioned above, is based on numerous experiments under various types of case issues. For other optimization algorithms, the number of particles selected is arbitrary. A vector named the equilibrium pool is constructed by combining five selected objects listed in (3).
The exponential term ( ) contributes to the main concentration updating rule in (4).
In (5) time is defined as a function that decreases with an increase in the number of iterations ( ).
Here, 2 is a variable that enables the exploitation skill to grow. As shown by (6), increasing exploration and exploitation abilities will allow us to easily achieve convergence by slowing down the search speed.
In addition, the generation rate is a crucial step that helps to provide a good exploitation phase to provide an exact solution to the optimization problem. The well-known 1 − space model is one of many models to calculate generation rate is in (8).
Where 0 and represents the initial value and the decay constant respectively. To produce a more symmetrical and controlled search output, and then equation can be rewritten in (9).
Here, gneration rate control parameter (GCP) is the parameter of the control of the generation which represents the real probability of the update term. In conclusion, the (10) represents the following 0 updating rule: To perform good optimization cuckoo search follow the three basic rules: a) cuckoos lay one egg at a time, then dump it into a nest and try to choose at random; b) keeping healthy nests and passing down the best eggs to the next generations is the top priority; and c) the assumption is that the number of nests with available hosts is fixed and that the cuckoo's eggs are discovered by the host birds with a probability of and (0, 1). Alternatively, the host bird can remove the egg from the nest or abandon the nest and build a new one to achieve a successful hatch. The nests are updated by random Lévy flights in the first stage of the algorithm. The two feature selection algorithms pseudocode is given below. Algorithm 1 gives the procedure to find the best optimal feature set for the speech recognition model from the above main feature set. One more popular nature-inspired algorithm i.e. CS used to find the optimal feature set and improve the recognition performance. The pseudo-code is described in Algorithm 2.  [24] Input: generate initial population and feature space Output: this is the final combination of features (best option) 1 Particle population is initialized as = 1, 2, 3, . . . , 2 Give each equilibrium candidate a high fitness level 3 Parameters can be freely assigned 1 = 2, 2 = 1, = 0.5; 4 While ( < [ ] do // read all the sub folders in dataset in main folder 5 For = 1, . . . . . , is number of particles do 6 Determine each ℎ particle according to its fitness 7 If ( ) < ( (1) ) then 8 Replace (1) with and ( ) with ( )

EXPERIMENTAL RESULTS AND DISCUSSIONS
We have relied on four prominent evaluating metrics like accuracy, F1 score, recall, and precision. These metrics are generated based on certain essential elementary measures contained in the confusion matrix. From the confusion matrix, we have calculated these parameters with the help of true positive, true negative, false positive and false-negative values. The two evolutionary algorithms above were developed with Python, openSMILE, and librosa tool kit. RAVDESS dataset contains a total of 400 speech samples of both males and females with different emotions in global language i.e. English. After applying the feature extraction to these data samples we got MFCCs and these dataset features are given to the SVM classifier to identify the emotion and estimate the accuracy of the model. In this work, to overcome the burden of classifier, the number of features is reduced using popular meta-heuristic approaches such as EO and CS. After applying the EO and CS algorithms individually, we achieved.
To evaluate the impact of this meta-heuristic approach precision and F1-score is the popular metrics used in the analysis of speech emotional classification. Here the Table 1 and Table 2 represents precision and F1-score of EO and CS-based FS model for SER tasks and corresponding graphical representation shown in Figure 4 and Figure 5. By analyzing the above results, we can say that meta-heuristic-based FS models have superior performance compared to the traditional feature selection methods which were discussed in section 2.
Finally, using this EO and CS algorithms-based FS model for speech emotion recognition accuracy is 89.64% and 92.71% respectively. Hence, most of the classification related problems like speech emotion recognition used these meta-heuristic optimizations and achieves impressive recognition rates. Table 3 shows the state of the art methods with our attempt using the meta-heuristic approach. In order to determine the impact of feature selection methods, the success rate obtained without any selection method is used as the reference value. From the above analysis, it is observed that, compared to traditional feature selection algorithms, the meta-heuristic approach is better accuracy for speech emotional intelligence.

CONCLUSION
The main goal of this work is to achieve an impressive recognition rate with a smaller feature set. In real-time, the success rate was decreased due to the high dimensional feature set. To address this problem we attempted the FS model for SER using meta-heuristic approaches like EO and CS algorithm. During our experimentation, using normalization to reduce the feature set and increase the precision and F1-score. However, during this work, we have faced some challenges related to several iterations to achieve the best fitness. One more challenge is to perform manual feature engineering instead of automatic feature engineering. Hence, there will be room for applying these meta-heuristic approaches based on automatic feature engineering like deep learning.