Parkinson’s Disease Detection based on Changes of Emotions during Speech

Parkinson’s disease (PD) is the neurodegenerative disease which affects 2-3 % of the population beyond 65 years of age in EU. When PD treatment is administered early, it is significantly more effective. Unfortunately, it is quite challenging to detect this disease at its early stage and when the symptoms can be recognized it is usually quite late. For this reason there is big motivation for development more accessible and accurate solutions for the detection of PD. One of the early symptoms is so-called hypomimia. This paper introduces an automatic method, which can objectively detect PD. The method is based on analysis of emotion changes during pronunciation defined speech exercises. We achieved balanced accuracy 69 % using XGBoost algorithm. As the exercise we proposed to use a Czech tongue twister - the difficult to pronounce sentence. The features can be explained and thus it can be used in clinical practice. We identified that the most valuable emotion for PD detection in this case is fear.


I. INTRODUCTION
Parkinson's disease (PD) is the neurodegenerative disease and affects 2-3 % of the population beyond 65 % [1]. It is supposed that the population aging society will be one of the main problems that not only Europe will face within the next 30 years. At the same time, the number of people who are suffering from the neurodegenerative disease will increase [2]. When Parkinson's treatment is administered in the early stage, the impairment of health is significantly smaller thanks to the applied treatment [3]. This is the reason why early detection of this disease highlydemanded. For this purpose, there emerged plenty of new approaches in recent years. Some of them are based on new technologies that bring new opportunities that can offer significantly easier Parkinson disease detection and thus to detect it in its early stage [4].
Unfortunately, it is not an easy task, since for many years the disease has only insignificant visible markers and cognitive skills in a healthy population also vary significantly due to the level of education, age, etc. The most accurate is magnetic resonance imaging (MRI), position emission tomography (PET), or computer tomography (CT), which are unfortunately relatively expensive and because of this, they are rarely used as a preventive screening but rather at an advanced stage of the disease [5], [6]. Because of this, it is desirable to create and use cheaper solutions.
Thanks to the technologies, the detection or measuring the progress of the disease could make visible even some of the markers, which are insignificant at first glance. Thus it can consequently significantly improve the quality of life of Parkinsonians, their family, and also to create communication channels with their physicians [7]. There is also an opportunity, that the numbers of necessary visits of doctors will be reduced with the usage of telemedicine solutions and thus decrease the cost spending for the health system [7], [8].
There are plenty of symptoms of Parkinson's disease. These symptoms could be divided into two groups: the motor and non-motor symptoms. For the motor symptoms, it could be inter alia included: the freezing of gait, bradykinesia, tremor, dyskinesia, dysphagia. Whereas, to the non-motor, it can be classified for example depression, anxiety, sleep disorders, urinary symptoms, dysarthria, or hypomimia [9], [10], [1].
The aforementioned hypomimia manifests in the reduction and slowness of facial movement (facial bradykinesia). The faces of Parkinsonians remind the so-called 'poker face' [11]. Additionally, the asymmetry of moving facial muscles is observed likewise the stiffness of the muscles is detected. Those symptoms are the reason for an instance of difficulties with expressing emotions [9]. There is also quite an interesting fact, that PD patients have worse ability to recognize the emotions of other people when compared to healthy control (HC) people [12].
The communication allows people to exchange information, ideas, and feelings or emotions [13] as well. In the case of Parkinsonism, this process is disrupted because this disease has also a negative influence on voice tract (dysarthria) [10] likewise additionally cognitive skills [14]. Moreover, the difficulties in communication skills affect also the social well-being of PD [14]. One of the speech exercise regarded as challenging to pronounce is tongue twister because of meeting problems of using correctly the mouth and tongue. It could be assumpted that dysarthria would manifest especially during trying to pronounce tongue twister by PD due to the deterioration of articulators [10].
In this paper, we examined the possibility to detect Parkin-son disease based on changes in emotion during speech exercise. The assumption of this work was dictated by the statement that the expression of emotions by PD is disturbed, they manifest the stiffness of the facial muscles likewise PD has a problem with the correct articulation. The novel approach of the machine learning model succeeded with 0.69 balanced accuracies and gave us answers which features were valuable for the obtained prediction. Moreover, we have done the statistical analysis and found those features, which were different between the group of PD and HC. Additionally, we tested more valuable tongue twister likewise reading aloud long text by participants of the experiment as the chosen speech exercise.
The main contribution of this paper is that we proved that tongue twisters can be used for PD detection and it can be based on facial features and emotion recognition. The found features can be interpreted and explained, so they can be used also in clinical practice.
The rest of the paper is structured as follows. In section, II describes related work. Section III introduces the experiment, i. e. how the experiment was performed (subsection A. Experiment description), describes the data (subsection B. Data description), how the features were extracted (subsection C. Feature extraction) and which metrics were chosen to evaluate created models (subsection D. Metrics). Section IV provides results and discussion (subsections: A. Results, B. Discussion). The last section concludes the paper.

II. RELATED WORKS
One of the most accurate approaches for PD detection is the imaging of the brain using e.g. MRI, CT, or PET [1]. The disadvantage of those methods is they are expensive and for this reason they are not very suitable for preventive screening. For this reason, we focus on cheaper approaches. Some of the symptoms like sleep disorders are challenging to monitor due to carrying out tests with the usage of uncomfortable for patients' gold standard method -polysomnography [15].
The novel methods that have emerged in recent years include audio analysis [16], video analysis [17], wearable sensor analysis [18]. There also some works that are focused on multimodal analysis [19]. Most of the works that are related to video analysis are focused on emotion analysis (hypominia) [20], and gait analysis [17]., and gait analysis [17].
There are currently only a few works that are focused on the detection of PD using the hypomimia. Due to the common problem of lack of the data in this domain, the approaches usually use just a statistical analysis [21], [22]. Few of the approaches use also automatic emotion detection [23], [24], [20]. These studies focus on the analysis of the differences in expressing the emotion of PD versus HC. The most frequent emotions that are taken into consideration are anger, fear, happiness, sadness, disgust or surprise, and neutral [25], [26].
For a better understanding of hypomimia in PD patients, there are a couple of approaches that have been used so far for facial expression. For example the analysis of electromyography (EMG) records, the affectograms, the Action Units, the Maximally Discriminative Facial Movement Coding System, and methods based on machine learning emotion recognition [25].
For PD analysis, facial features, and action units (AU) were used so far. Such a system was introduced for example in [26]. It was presented by the function of the frequency of AU. The results indicated the statistical significance of PD versus HC. Additionally, the measurements of EMG expressed through the muscle contradiction showed also a statistical difference between studied two groups. Another work focused on AU [24]. The automatic detection of PD was achieved with the usage of a 3D sensor, AU, and linear regression. The quantity of prediction was up to 0.99 Area Under Curve (AUC). Unfortunately, the dataset used is relatively limited.
In 2018, it was introduced a solution that used Mel Frequency Coefficient (MFCC) which were extracted audio and AU from video, which regressed the level of facial expression [20]. The hierarchical Bayesian network was trained on the dataset, which contains records of Parkinsonians. The obtained result for multiclass classification was 0.55 F1-score [20]. Unfortunately, the dataset is not reproducible since the dataset is private.
However, the PD detection approaches based on emotions are subjective. The imitation of emotions at a specific point of time will be at some level biased. Additionally, labeling is also not optimal, mostly done by physicians which are also a subjective process.
A few neural network architectures were proposed so far for the facial emotion recognition (FER) purpose and part of them were trained on the FER2013 dataset. The dataset contains 7 types of emotions: anger, disgust, sadness, surprise, fear, happiness, and neutral. The total number of the data were 35 685 examples [27]. One of the proposed architecture for this purpose was the simple convolutional neural network with the submission of activation function from traditional softmax to linear support vector machine [28]. The achieved accuracy was 71.2 %. Human accuracy on the FER2013 dataset was 65 ± 5 %. Another presented architecture was based on a deep neural network with two convolutional neural layers, max-pooling, and also four Inception layers. The achieved accuracy was 66.4 % [29]. Next, in [30] VGG16 was used together with a soft label constructor to take into consideration possible augmentation of the emotions. They reached 73.73 % accuracy [30].
The conclusion from the recent work is that FER can outperform humans' abilities in emotional recognition. Thanks to this, the neural network architecture has the potential to improve also PD detection, for example for clinical practice. One of the biggest limitations of the current approaches is still a pretty limited amount of data for the training of machine learning algorithms. For those reasons, analysis of general emotion detection can vary significantly based on a particular activity. In this paper, we propose a technique that is based on pronunciation unified speech exercises and we validate whether they can improve the quality of PD detection.

A. Experiment description
The main goal of this experiment has been to evaluate the accuracy of Parkinson's disease detection based on changes in the expression of emotion during the speech. In particular in face mimic changes during exercises -i.e. during reading selected text by PD and HC.
The scheme of the carried-out experiment is shown in fig. 1. First, the process of feature extraction was done. It is described in more detail in section III-C.
Subsequently, the features were pre-selected using the minimum redundancy maximum relevance (mRMR) algorithm. The 50 most relevant features were chosen. Afterward, the machine learning classifiers were used with the Stratified 10-fold Cross-Validation. The 6 classifiers were trained and the results obtained with the usage of them were compared. The following classifiers were used: Support Vector Machine (SVM), knearest neighbors (kNN), Decision Tree (DT), Random Forest (RF), Logistic Regression (LR), and XGBoost.
Results were evaluated using U-Mann Whitney tests which validated whether the distributions of the two populations (HC vs. PD) for each feature differ. The false discovery rate (FDR) correction was also calculated to offset the number of types I errors.

B. Data description
We enrolled 45 HC (21 females (mean age 62 ± 9.22), 24 males (mean age 66 ± 9.17)) and 70 PD patients (27 females (mean age 68 ± 8.04), 43 males (mean age 66 ± 7.83)). For female patients mean UPDRS III was 21.6 ± 13.50 and the mean duration of PD in years was 7.2 ± 4.82, whereas for male Parkinsonians mean UPDRS III was 26.3 ± 11.31 and mean years duration of PD in years was 7.9 ± 4.64. The patients and healthy control groups were recorded during pronunciation inter alia: vowels, sentences, and reading long texts. The records of the video were acquired with the sampling frequency of 25 frames per record (FPS).
For this experiment, a Czech sentence (Celý večer se učí sčítat) 1 was selected. and reading them aloud long text was selected. The meaning of the sentence is "He's been learning to count all night", but rather than its meaning it is important that the sentence is hard to pronounce. The collection of the data was done by neurologists. The experiment was approved by the Ethical Committee of Masaryk University.
This we got in total 70+45 = 115 video records for speech exercise. The total length of the records is various.

C. Feature extraction
The process of obtaining features consists of several steps. General idea was to extract those features that can describe changes of emotion in a certain period, here, dedicated next and this same could be applied to numerically analyze emotions during speech exercise. First, it was used publicly available model of facial emotion recognition (FER) [31] trained on the publicly available dataset: FER 2013 [32]. FER architecture contains two steps of facial emotion recognition to provide the most accurate prediction. It is called Multitask Cascaded Convolutional Networks (MTCNN) for face detection [33] and architecture for evaluation the intensity of the emotions, they are anger, fear, happiness, sadness, disgust, neutral and surprise [31]. The features were extracted from video records in the form of a seven time-series, which represent dynamics of the facial expressions. The regression values were calculated separately for each frame steamed from the video. An example is shown in fig. 3. Moreover, the results of using FER on a single frame of PD can be seen in fig. 2.
In the next step, changes of emotions during the given time interval are represented numerically and statistical differences between two examined groups, i. e. PD versus HC was calculated. They are minimum, maximum, the amplitude of the course, standard deviation (STD), relevant standard deviation (RSTD), variance, the slope of the time-series, and features computed based on distribution data, i. e.: kurtosis and skewness. Furthermore, also some indicators describing the measure of information disorders were taken into consideration. They were approximate entropy dedicated to medical series and Shannon entropy. This set of features was calculated for all of the types of records of emotions.

D. Metrics
For the evaluation purpose, 4 metrics were calculated. Due to the small imbalance of the dataset, the accuracy balanced and Matthews's correlation coefficient was computed. Additionally, clinically valuable features of sensitivity and  specificity were taken also into consideration. The equation of them are presented below:

E. XGBoost as classifier
A few classifiers were tested in this work, however, the XGBoost classifier deserves special attention. This algorithm characterizes optimizing specific loss function as function approximation and using regularization techniques. Additionally, XGBoost is computing faster because of applied parallelization and distributed techniques [34]. The equations 4, 5, 6, 7, 8 explain mathematically the used one of the biggest advantages of XGBoost -regularization technique.
This algorithm needs to minimize the function which contains loss function l and regularization part Ω (see Eq. 4).
However, the obstacle with minimization Eq. 4 is that commonly used optimization techniques need to be performed in the Euclidean space and they are not suitable in this case. For this reason, the second-order Taylor approximation is used for this problem to optimize new obtained function in Euclidean space.
The assumption of this transformation is to use as specific point previous step of iteration (t-1) and loss function l as the function, which is going to be approximate. After this transformation, the function is presenting in such a shape: where g i and h i are the first and second-order gradient, respectively. In the next step, it is removing the constant part (see Eq. 6 and it use at the moment the optimization techniques dedicated to quadratic functions.
The main point is to find the weights w of the nodes and after function's transformations present as form Eq. 7 and it is equal to Eq. 8.
Where, G jm equals to the sum of gradient in the region (leaf) j and H jm is to the sum of hessian in region j for the set of leaves m.
To summarize and highlight the mentioned optimization technique of XGBoost, this classifier is taking into consideration L1 and L2 regularization likewise penalization of the number of leaf nodes [34].

A. Results
The results of the statistical analysis of tongue twister are shown in the tab. I. The 10 most statistically different features between PD and HC were standard deviation (fear std), variance (fear variance), maximum (fear max), range (fear range) of fear, standard deviation, variance (angry variance), minimum (angry min) of anger and approximate entropy of sadness. For all of them, p-value and p-value with FDR correction was below 0.05 value. Additionally, the median and interquartile range were calculated for all of them. According to statistical analysis, the most relevant dependency which allows us to distinguish between Parkinsonian versus healthy person is a manifestation of fear on the face.
Results of prediction by 6 classifiers are shown in the tab. II. The most accurate solution that allows us to detect Parkinson's disease has been identified as the XGBoost classifier. The accuracy balanced succeed in 0.69, MCC was also the highest for this algorithm and was equal to 0.39. The sensitivity and specificity were 0.71 and 0.67 respectively. The highest sensitivity was achieved for LR (0.73) and specificity 0.73 was achieved for DT.
The interpretability of the model was explained with the usage of SHAP value's and it is presented in Fig. 4. The most important features according to explainability in descending order are the standard deviation of fear (fear std), approximate entropy of surprise (surprise approx entropy), the variance of fear (fear variance), maximum of anger (angry max), range of fear (fear range), the variance of anger (anger variance), maximum of fear (fear maximum), mean of surprise (surprise mean), mean of fear (fear mean), approximate entropy of sadness (sad approx entropy). The fear std, fear variance, angry max, angry variance, fear max, and fear mean are positively correlated with the prediction of Parkinson's disease. However, the surprise approx entropy and sad approx entropy are negatively correlated with Parkinson's disease. Results of predictions between each particular speech exercise were also compared and are shown in the tab. III. It was evaluated with the same metrics for also reading long texts. It was found that better balanced accuracy was achieved with tongue twister exercise and it was 0.69 versus 0.60 balanced accuracy for reading long text.

B. Discussion
The automatic method of detection of Parkinson's disease and analyzing changes of emotion for the Czech tongue twister was created. We have created features that occurred to be statistically different for this task. The most valuable feature was identified expression of fear emotion. In this paper, this emotion was expressed mainly due to trying to pronounce difficult sentences. We found that those speech exercises are even better for the purpose of detecting PD than reading long text. The best machine learning algorithm was identified as the XGBoost, this even using an imbalanced dataset. The changes in fear were correlated the best by features like standard deviation, variance, range, or mean. It could mean that this kind of emotion is recognized during this speech exercise and difficulties which meet the PD making them at a certain level anxious in comparison to HC. Moreover, the approximate entropy of sadness and surprise is negatively correlated with Parkinson's disease, this fact could indicate lower changes in the expression of this emotion for PD. Additionally, the general performance of the XGBoost model was 0.69 balanced accuracy which indicates promising results and shows that changes in the emotion of PD during speech exercise are valuable from the scientific point of view.

V. CONCLUSION
In this study, we presented the novel attitude of detection of Parkinson's disease based on facial features expression. Firstly, we elaborated on the features, which could differentiate healthy control group versus Parkinsonians. The features describing differences in expressing fear during the time were the most significant from the statistical point of view. The XGBoost classifier outperformed other classifiers and achieved 0.69 balanced accuracy. The chosen by us algorithm allows us to apply SHAP's value and this same interpret created model. Once again, the features based on fear emotion indicate on positive correlation with Parkinson's disease. Furthermore, it seems that the selection of speech exercise plays a crucial role in the final performance. The space for future research is to increase the database and likewise test another tongue twister or develop another value for this research speech exercise.