END-2-END MODELING OF SPEECH AND GAIT FROM PATIENTS WITH PARKINSON’S DISEASE: COMPARISON BETWEEN HIGH QUALITY VS. SMARTPHONE DATA

Parkinson’s disease is a neurodegenerative disorder characterized by the presence of different motor impairments. Speech and gait signals have been analyzed to detect the presence of the disease and the severity in patients. However, most studies have been performed in controlled conditions using high quality data, which make those studies not suitable for a continuous at-home evaluation of the state of the patients. The developed technology should be evaluated in more realistic scenarios, for instance using smartphone data. We propose the use of state-of-the-art deep learning techniques to evaluate the speech and gait symptoms of patients. The proposed methods are evaluated in two scenarios to cover both high quality and smartphone data. The results indicate that it is possible to classify patients and healthy subjects with accuracies over 92% in both scenarios. The proposed methods are also promising to evaluate the severity of the speech symptoms and the global motor state of the patients.


INTRODUCTION
Parkinson's disease (PD) is a neuro-degenerative disorder that produces different motor symptoms in the patients, including tremor, rigidity, and bradykinesia, among others [1]. 70-90% of patients develop a a speech impairment called hypokinetic dysarthria [2], which manifests in the imprecise articulation of consonants, monoloudness, and monopitch, among other symptoms. The traditional assessment of the disease depends on the experience of the clinician performing the screening, which makes the diagnosis of disease as well as its degree of severity difficult. It is important to identify the earliest symptoms of PD in order to be able to treat the disease in the prodromal phase, and to evaluate how severe the symptoms of a patient are in order to prescribe a better treatment.
Several studies have modeled the speech of PD patients in terms of phonation, articulation, prosody, and intelligibility [3,4]. These traditional methods are based on the computation of hand-crafted features such as jitter, shimmer, or formant frequencies that may not completely model all the phenomena that appear due to the presence of the disease and the dysarthria level of patients. There are recent studies that have proposed the use of deep learning methods to model the speech of PD patients [5,6,7]. Most of them consider convolutional neural networks (CNN) to process time-frequency representations of the speech signals like Mel-spectrograms. The authors usually focus only in the classification of PD vs. healthy control (HC) subjects, leaving aside the evaluation of the disease severity. The accuracy reported in those studies ranges from 80% to 90% for the classification of PD vs. HC subjects. The research community has shown also a growing interest in the automatic gait analysis of PD. The assessment is performed commonly with inertial sensors attached to the body of the patients [8,9] and with force-sensitive sensors placed inside the shoes of the participants [10]. By using inertial sensors, it is possible to detect and to characterize specific movements and to monitor activities of daily living of PD patients [11]. Most of the studies have considered kinematic features based on the duration and velocity of the steps [10,12,13]. Other studies have considered spectral features to evaluate the harmonic structure of the gait process [14,15], or non-linear dynamics methods to model longrange autocorrelations and stability patterns of the walking process [16,17,18]. There are few studies that have considered deep learning models to evaluate the gait of PD patients using the raw gait signals in order to the neural network automatically learns the most appropriate features [19,20].
Most studies to model speech and gait symptoms of PD patients have been performed in controlled conditions, using high quality speech data, and with external inertial sensors attached to the body of the participants [21]. These aspects make many of the proposed studies available for the clinical practice but not for an at-home evaluation of the state of the patients. The technology to monitor the state of PD patients should be evaluated in more realistic scenarios, for instance using smartphones. A more reliable assessment of the patients at-home can be performed using the microphone and the inertial sensors available in smartphones, which can be used to evaluate different motor impairments in the speech production, and in the upper and lower limbs.
We propose the use of state-of-the-art deep learning techniques to evaluate the speech and gait symptoms of PD pa-tients. The proposed methods are evaluated in two scenarios to cover both high quality data, which is normally captured in a clinical evaluation, and smartphone data, which can be used to monitor the state of the patients at-home. In both scenarios, the methods are used to classify PD vs. HC subjects, and to evaluate the severity of the motor symptoms. The results indicate that it is possible to classify PD patients and HC subjects with accuracies over 92% using both high quality and smartphone speech data, and with accuracies over 94% using gait data, in both scenarios. We believe that within the next decade, monitoring of motor symptoms of PD patients will gradually shift from the clinic to at-home, where a continuous monitoring can be performed. The next step will be the application of the proposed methods in the longitudinal and individual evaluation of the symptoms of the patients, in order to monitor the progression of the disease per patient.

Multimodal corpus
The data include high quality speech and gait signals from 106 PD patients and 105 HC subjects, Colombian Spanish native speakers. These data are age-and gender-balanced. 94 of the patients were labeled according to the third section of the movement disorder society -unified Parkinson disease rating scale (MDS-UPDRS-III). Additionally, the speech recordings from 93 of the PD patients and from 48 of the HC subjects were labeled according to the modified Frenchay dysarthria assesment (m-FDA) scale, which evaluates the dysarthria severity of the participants [22].
The speech protocol includes the utterance of six diadochokinetic (DDK) exercises, the reading of 10 sentences, a read text with 36 words phonetically balanced, and a monologue where the participants were asked to speak about their daily routine. The speech signals were recorded with a sampling frequency of 16 kHz and 16-bit resolution. The gait signals were captured with the eGaIT system, which consists of a 3D-accelerometer (range ±6g) and a 3D gyroscope (range ±500 • /s) attached to the external side (at the ankle level) of the shoes [9]. Data from both feet were captured with a sampling frequency of 100 Hz and 12-bit resolution. The exercises included 20 meters walking with a stop after 10 meters (2x10), 40 meters walking with a stop every 10 meters (4x10), 20 meters walking with stops every three meters (Stop & go), heel-toe tapping, and the time up and go (TUG) test.

Apkinson corpus
This corpus was collected using the Apkinson android application [23], which was designed to record several signals using the microphone and accelerometer available on smartphones. The data contain speech and movement signals collected from 38 PD patients and 60 HC subjects. 26 of the patients were labeled with the MDS-UPDRS-III. None of the participants in the HC group presented any neurological or movement disorder. The age and gender distributions per class is also balanced for the Apkinson corpus. The speech tasks include the same six DDK exercises from the multimodal corpus, the reading of the 10 sentences, and a monologue based on the description of the cookie theft picture from the Boston diagnostic aphasia examination. The movement signals contain 7 tasks captured with the inertial sensors of the smartphone, and include: (1) Posture, where the patient stands up straight during 30 seconds, (2) circles, where the patient has to make circles with the extended arm, (3) pronation/supination, where the patient stretches out the arm with the downward palm, and then turn the palm up-down, several times, (4) finger to nose, where the patient extends the arm and then touches his/her nose and extends the arm again, several times, (5) postural tremor, where the patient extends the arm and holds the smartphone in this position for at least 10 seconds, (6) 4x10, where patients perform a short path walking four times, and (7) Free Gait, where patients perform a normal walk exercise during two minutes. For the case of the walking exercises we ask the patients to put the smartphone in their pockets. For the case of the hand movement exercises the patients take the smartphone with their hands.

Speech modeling
The proposed model to process speech signals is based on CNNs using Mel-spectrograms as input. We computed the Mel-spectrum for windows of 32ms length and a time-shift of 4ms. This Mel spectrum is computed with a frequency resolution of 512 points and 64 Mel filters. We then stack together 126 of these Mel spectra to form a Mel spectrogram with 500ms length, which is used as input for our proposed CNN. These parameters lead us to a time frequency representation of 126 time steps and 64 frequency bins. The spectrograms are modeled with a ResNet18 architecture, which has three residual blocks and 18 convolutional layers (see Figure 1). The skip connections help to control the vanishing gradient problem when we have deeper models. Dropout layers were considered to regularize the output of the residual blocks. The final decision is made by a fully connected layer with a Softmax activation function.

Gait & Movement modeling
We propose a deep learning model based on 1D-convolutions to process the raw gait signals. Figure 2 illustrates the proposed architecture to model the gait signals of the patients. The input corresponds to 3 seconds-length frames of the gait signals. For the case of the multimodal corpus, the input is formed with 12 channels corresponding to the 3Daccelerometer and 3D-gyroscope attached to the left and right foot. The input for the Apkinson data includes only three channels from the 3D-accelerometer from the smartphone. The duration was chosen to guarantee at least 3 periods of the gait signals. The input then passes through a set of two 1Dconvolutional layers, which learn a filter-bank. The filtered signals then pass through a stack of two bidirectional gated recurrent unit (GRU) layers to model the temporal structure of the sequences. The last part of the network is an attention mechanism, which assigns more weights to specific parts of the gait sequence, such as pauses, the swing phase, the stance phase, or the beginning/stopping of the gait task.

EXPERIMENTS AND RESULTS
Different experiments are performed to classify PD patients vs. HC subjects and to evaluate the disease severity of the participants. All models are validated with a 10-fold stratified cross-validation strategy. The first experiment corresponds to the classification of PD patients vs. HC subjects using speech signals from the multimodal and Apkinson corpora. The results are shown in Table 1. The accuracy for the multimodal corpus range from 88.8% to 92.4%, similar to the one obtained for the Apkinson data, which ranges from 86.7% to 92.2% depending on the speech task. These results confirm those reported previously about data collected with smartphones having enough quality to classify speech signals from PD patients [24,25,26]. This study is the first one to confirm that similar results are obtained both with high quality and smartphone data using a full deep learning approach. Results reported here also consider higher amounts of data than the studies previously reported. The results classifying PD patients and HC subjects using the gait & movement signals are shown in Table 2. The accuracy for the multimodal corpus ranges from 90.6% to 98.7%. The highest accuracy is observed in the Stop & Go task, which is the one when the patients have to perform more start/stop movements of the lower limbs, causing Freezing of Gait (FoG) episodes in the patients that are modeled with our proposed approach. The results observed for the Apkinson corpus indicate that gait exercises like 4x10 and Free gait produce the highest accuracies, and the results are similar to the ones obtained with the high quality inertial sensors used in the multimodal corpus. Hand movement tasks like the finger to nose and the circles produce moderate accuracies. Conversely, tasks such as postural tremor, posture, or pronation/supination are not accurate for the classification using the proposed model. These particular exercises have a very low dynamic compared with the walking tests. The lack of accuracy for these particular tasks can be explained because such small temporal variability is not properly captured with the smartphone sensors. Unfortunately, we do not have data collected with the high-quality sensors to address these tasks and validate these results. Other methods can be proposed to model the information produced by these types of tasks. For the third experiment we grouped the subjects from the multimodal corpus into three classes according to their dysarthria severity based on the m-FDA scale [22]. Unfortunately, we do not have labels of the m-FDA score for the subjects in the Apkinson data. The number of subjects per class was determined to guarantee balanced groups. In this  Table 3. The highest accuracy was observed in the monologue task (55.7%). The confusion matrix for the best result is observed in Figure 3a). The class with the highest accuracy corresponds to the patients with intermediate dysarthria level, followed by patients in mild and severe states, respectively. Finally, we classify the patients in different groups according to their motor severity based on the MDS-UPDRS-III. The patients were grouped into three classes according to their MDS-UPDRS-III score using the 33th and 66th percentiles of our data as a border between the three groups. The subjects in each group were labeled as patients in mild, intermediate, and severe states. The models were then trained to classify these three classes. The results are observed in Table 4. For the multimodal corpus, the highest accuracies are obtained with the TUG and with the Stop & Go tasks, similar to the results observed in the bi-class problem in Table 2. These results confirm the importance of such exercises for the assessment of the gait impairments of PD patients. The confusion matrix for the best result (TUG test) is observed in Figure 3b). The class with the highest accuracy corresponds to the patients in severe state, followed by patients in mild and intermediate states, respectively. Note also that the missclassified patients from the mild and severe classes are mainly miss-classified as patients in intermediate state of the disease rather than in the other extreme class. Regarding the Apkinson corpus, the highest results are again obtained with the gait exercises (4x10 and Free Gait). For this case there are differences of up to 12% between the results obtained in the multimodal and Apkinson corpora. We believe that these differences are because of the reduced size of the Apkinson data to train the models for this particular and more difficult problem. Additional data using the Apkinson app should be collected and labeled to improve the results.

CONCLUSION
The present study proposes the use of deep learning methods to classify PD patients and HC subjects, and to evaluate the disease severity of the patients, using information from speech and gait signals. We evaluate the impact of the proposed approach in signals collected with smartphone sensors. The results show that it is possible to classify PD patients and HC subjects with accuracies of up to 92% using speech signals and of up to 98.7% using gait signals. In addition, the results indicate that there is not a visible difference in the accuracies observed when considering high quality vs. smartphone data. The disease severity of the patients is estimated with accuracies up to 55.7% for the speech impairments, and up to 64.9% for the global motor deficits. Additional data from smartphones should be collected and labeled to improve the results of the disease severity assessment. The next step will be to evaluate the proposed methods in the individual monitoring of the symptoms of the patients. In addition, we are currently running experiments combining speech and movement and the results look promising. We hope to include those results in future studies.

ACKNOWLEDGMENTS
This project received funding from the EU Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Grant Agreement No. 766287. This study was inspired by our work in the 2016 Jelinek Memorial Summer Workshop (JSALT), which was supported by JHU. Thanks also to CODI from Universidad de Antioquia grant No. PRG2017-15530.