A machine learning approach to predict emotional arousal and valence from gaze extracted features

In the last years, many studies have been investigating emotional arousal and valence. Most of them have focused on the use of physiological signals such as EEG or EMG, cardiovascular measures or skin conductance. However, eye related features have proven to be very helpful and easy to use metrics, especially pupil size and blink activity. The aim of this study is to predict emotional arousal and valence levels which are induced during emotionally charged situations from eye related features. For this reason, we performed an experimental study where the participants watched emotion-eliciting videos and self-assessed their emotions, while their eye movements were being recorded. In this work, several classifiers such as KNN, SVM, Naive Bayes, Trees and Ensemble methods were trained and tested. Finally, emotional arousal and valence levels were predicted with 85 and 91% efficiency, respectively.


I. INTRODUCTION
Among the various dimensional models of affect, the 2D arousal-valence emotion space of Russell [1] is the most commonly used one. Emotional valence describes the extent to which an emotion is positive or negative [2], whereas arousal refers to the level of calmness (i.e., low arousal) or excitation (i.e., high arousal) elicited by a stimulus [3].
Physiological signals combined with eye-related metrics are the most commonly used modality in order to estimate one's emotional state [4]. However, there are several studies that have used eye features as the only predictor to identify emotional arousal and valence levels. These studies attempt to solve either multi-class [5], [6], [7] or binary classification problems [8], [9] with success rates for multi-class cases remaining below 80%, while the binary classification approaches have proven to be more effective reaching up to 93% prediction success. Nevertheless, none of the aforementioned studies attempts to investigate in parallel the discrimination between the arousal and valence levels.
A meta-analysis of the related studies has shown that the gaze extracted features that better indicate emotional arousal are pupil diameter and blink duration [3].
This work reports the results of a study in which participants watched emotion-evoking video clips during which, an eye tracker was used to capture eye motion and activity. The features extracted from all acquired gaze signals were used to train and evaluate a set of different classification algorithms, including decision trees, discriminant analysis, support vector machine (SVM), k-nearest neighbors (kNN) and ensemble learning algorithms, aiming to accurately classify the various arousal and valence levels.

II. PROTOCOL FOR DATA COLLECTION
In the present study, 37 participants (22 female, 15 male) with mean age 29 (SD:7) years were enrolled. Binocular visual acuity at 80 cm was measured before each trial (mean VA: -0.10±0.07 logMAR). Mean illuminance at cornea when screen was on, was 450 (SD: 24) lux. Two video clips for each of the 4 emotions (happiness, sadness, anger and disgust) were obtained from the public database FilmStim [10]. Two more video clips served as neutral videos thus creating a total of 10 videos watched by each participant. The video clips were presented in a randomised order. After each video clip, participants were presented with a questionnaire for the self-assessment of the above mentioned emotions in a scale from 1-10. The design of the study is shown in Fig. 2. We only accepted selfassessments of 5 or higher as a true indication of the presence of a specific emotion. A self-assessment score lower than 5 was treated as emotionally neutral. In parallel, we estimated the level of arousal and valence for each of the emotions (Table  I), based on [11].
The video clips were presented on a computer screen at 80cm distance from the study participant as shown in Fig.  1. All measurements were performed with the participants seated on a chair with their head stabilized by means of a chin and head rest to minimize head movements. Eye tracking measurements were recorded using the Pupil Labs "Pupil Core" eye and gaze tracker. From the 10 emotion-evoking videos watched by each of the participants and after removing the invalid recordings where the participants looked away from the screen or closed their eyes in order to avoid a certain video scene, a total of 362 examples were collected. The classes of valence and arousal in which the data were split can be seen in Table II. The "positive" valence and "low" arousal classes constitute minority classes as they contain a few number of examples. The study protocol was approved by the Ethics Committee of FORTH and all participants have signed written consent.

A. Feature extraction
The feature extraction procedure is extremely significant in order to be able to efficiently discriminate among the various emotional arousal and valence levels solely from the low level eye and gaze metrics collected from the eye tracker. For each recording sequence, the raw gaze points from Pupil Core are processed and analyzed to ensure that the participants have watched the whole video each time and did not look away from the screen neither close their eyes for duration longer than the average blink time to avoid watching. If these prerequisites are satisfied, fixations and saccades are identified based on the I-VT algorithm proposed by Salvucci et al. 2000 [12] and fixation and saccade related features are calculated. Furthermore, pupil diameter and blink timings computed from the eye tracker contribute to further extract pupil and blink related features. In total, 29 eye and gaze features are extracted and are presented in Table III.

B. Data processing
The problem with imbalanced classification is that there are too few examples of the minority class for a model to effectively learn the decision boundary. One way to solve this problem is to oversample the examples in the minority class. An improvement on duplicating examples from the minority class is to synthesizing new examples from the minority class. This is a type of data augmentation for tabular data and can be very effective. We employed the widely used approach to synthesizing new examples called the Synthetic Minority Oversampling TEchnique, or SMOTE [13].
In addition, due to the fact that many machine learning algorithms perform better when numerical input variables are scaled to a standard range [14] we performed data scaling using the MinMax Scaler since our data are not normally distributed. Furthermore, MinMax Scaler rescales the dataset in such a manner that all feature values are in the same range (0-1).

C. Feature selection
We tested and performed feature selection with two different approaches. During the first approach we used a regularization method and compared it with the ANOVA test and then we obtained the feature importance from ensemble methods. The second approach involved the LASSO regularization analysis, a regression analysis method that performs both variable selection and regularization, thus improving accuracy and interpretability [15].

D. Training and testing
For the classification procedure we split the data into training and testing, with the number of the test data amounting 20% of the total number of examples. We selected a range of well-known and widely used classifiers to study their n estimators = 10) Furthermore, and in order to choose an algorithm that is able to learn from training data how to recognize the classes of the target variable by minimizing the error function, we need to tune each classifiers' hyperparameters properly [16]. There are many hyperparameters and there is no general rule about their efficiency and suitability, so we had to find the right combination that fitted our data better. Therefore, we performed a RandomSearch where the machine iterated 1000 times through training data to find the combination of parameters that maximizes the accuracy.

E. Model evaluation
We evaluated the models using the metrics of accuracy, precision, recall and f1-score. In the multi-class classification, these 3 metrics are calculated on a per-class basis. Moreover, the models were validated using a k-fold cross-validation (k = 10) to check how well they are capable of being trained and predict unseen data. Finally, for a more comprehensive representation of our results we calculated and plotted confusion matrices and ROC curves for each fold, thus illustrating how the ability of the classifier changes as its discrimination threshold is varied. For the multi-class classification problems, we calculated the ROC AUC for all classes using One -vs -One (0V0) strategy, which is a heuristic method for using binary classification algorithms for multi-class classification [17].

A. Received Dataset
Annotation of the data was performed based on the level of valence and arousal acquired from the self-questionnaire filled by each participant. Overall, we performed four classification attempts that can be observed in Table II. The first two concern the investigation of the presence of high arousal and positive valence (binary approach), respectively, while the other two refer to an additional attempt to discriminate emotional states among their respective levels (multi-class approach) as seen at Table II. The training examples were 290 and their respective numbers for each class before the oversampling process are presented again in Table II. For all the classification trials the training algorithms were tested in a Python 3.6 environment and the respective training success rates were extracted for each trial. The models with the higher accuracy were stored and used later for predictions. The test data included a total of 72 examples. For these the prediction rate arousal and valence states, recall, precision and f1-score were estimated.

B. Results
Figs. 3 -6 list the name of each classification process, the classifier that performed best in terms of accuracy, the feature selection method as well as the precision, recall, f1-score and accuracy of the chosen model. As can be observed from Fig. 3, the Ensemble Gradient Boosting classifier proved to be superior over the other classification models tested. The model achieved to predict the presence of high emotional arousal with 85% success rate. The features for this procedure were selected using LASSO analysis. Furthermore, the model managed to predict 95% of positive instances of "high" emotional arousal that were actually correct, while the respective percentage for the "not high" examples was 77%. The recall percentages for "high" and "not high" examples were 75 and 96% respectively. Finally, the f1-score is approximately 85% for both "high" and "not high" instances. The results of our attempt to discriminate among positive and not positive emotional valence are presented in Fig 4. 9 out of 10 instances were classified correctly by the Random Forest classifier, while ANOVA analysis was used for the selection of the dominant features. In this binary classification problem, the recall rate of "positive" instances was found to be 95% and the respective precision rate achieved was 88%. Moreover, the 95% of the positively classified "not positive" cases were relevant. The f1-score for this classification trial remains over 90% for both classes. In the next classification procedure, we attempted to identify between the three levels of emotional arousal, low, medium and high. The results obtained are shown in Fig. 5. As can be observed, in this multi-class problem we achieved a classification accuracy 82%. The best classification algorithm in terms of accuracy was the Extra Trees and the selection of the dominant features was performed by LASSO analysis. The positive predictive value of "medium" arousal level was found to be 85% and 83% for the "low" level. In parallel, the Extra Trees achieved satisfying sensitivity and f1-score rates for the "low" level with 97 and 90% respectively. The final classification problem focused on distinguishing between three levels of emotional valence i.e. negative, neutral and positive. In this second multi-class classification attempt the Extra Trees classifier was once again the most efficient in terms of prediction success rate reaching up to 77% correct predictions. In detail, the "positive" instances of emotional valence reached precision, recall and f1-score percentages of 88% while the respective scores for the other two valence levels remained slightly lower.
Overall, the binary classification of emotional valence into "not positive" or "positive" using the Random Forest model achieved the best prediction rate 91%. However, when the "neutral" class was added to create a multi-class problem, this percentage was reduced by 14%. In addition, the precision, sensitivity and f1-score of the prediction of the "neutral" valence class were relatively lower than the other two classes. Finally, regarding the emotional arousal level recognition, the success rates of the binary and multi-class problems differed only by 3% while the identification of the "high" arousal level, which demonstrates significant emotional charge, was correct in 85% of cases.

V. DISCUSSION
In the present manuscript, we report our work focused on classifying emotional arousal and valence into their relevant levels, using eye and gaze tracking features. To this goal, an experimental trial was performed, for collecting eye and gaze tracking data from subjects watching emotion-evoking videoclips and self-accessing their emotions.
For each study participant, several eye and gaze related identification parameters were extracted, feature selection and data processing techniques were implemented and machine learning models were trained. A number of classifiers were tested and the best performing classifiers were identified.
From the results presented in Section IV-B, the highest success rate was observed during binary classification between not positive and positive emotional valence, with the Random Forest algorithm outperforming all others achieving 91% accuracy as it is able to effectively reduce the risk of over fitting, balance the error for unbalanced data, and determine the importance of features quickly. However, the inclusion of the "neutral" class proved to be challenging leading to a significant decrease in our system's performance. Regarding the emotional arousal level estimation, the binary as well as the multi-class identification tasks provided promising results reaching up to 85% correct predictions.
In this work, we have verified that using only eye and gaze metrics to estimate emotional arousal and valence produces similar results to [8], [9]. Furthermore, when comparing our results to those of [5], [6], [7], it must be pointed out that our results encourage the development of an emotion identification system with high discretization ability.

VI. FUTURE WORK
The results presented in this article demonstrate the potential of utilizing machine learning optimization for discriminating between the various emotional states, while reinforcing the imperative need for future research. Towards this direction we aim to extend the dataset by adding more participants into the study. Furthermore, we are currently investigating new models as well as the applicability of deep learning methods in order to create a model for synchronously estimating emotional arousal and valence levels with high efficiency. Finally, we plan to compare our findings with research works that utilize multimodal approaches, i.e employ additional biosignals and investigate the necessity and the potential of combining eye and gaze data with other biometrics for increased performance with respect to the computational cost.
ACKNOWLEDGMENT This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 826429 (Project: SeeFar). This paper reflects only the author's view and the Commission is not responsible for any use that may be made of the information it contains.