Acoustic Feature Extraction with Interpretable Deep Neural Network for Neurodegenerative related Disorder Classiﬁcation

Speech-based automatic approaches for detecting neurodegenerative disorders (ND) and mild cognitive impairment (MCI) have received more attention recently due to being non-invasive and potentially more sensitive than current pen-and-paper tests. The performance of such systems is highly dependent on the choice of features in the classiﬁcation pipeline. In particular for acoustic features, arriving at a consensus for a best feature set has proven challenging. This paper explores using deep neural network for extracting features directly from the speech signal as a solution to this. Compared with hand-crafted features, more information is present in the raw waveform, but the feature extraction process becomes more complex and less interpretable which is often undesirable in medical domains. Using a Sinc-Net as a ﬁrst layer allows for some analysis of learned features. We propose and evaluate the Sinc-CLA (with SincNet, Convolutional, Long Short-Term Memory and Attention layers) as a task-driven acoustic feature extractor for classifying MCI, ND and healthy controls (HC). Experiments are carried out on an in-house dataset. Compared with the popular hand-crafted feature sets, the learned task-driven features achieve a superior classi-ﬁcation accuracy. The ﬁlters of the SincNet is inspected and acoustic differences between HC, MCI and ND are found.


Introduction
Neurodegenerative disorders (ND) are caused by slow progressive loss of neurons in the central nervous system leading to an irreversible selective loss of brain functions causing dementia. With an aging society, the number of people living with ND is increasing rapidly. Before being diagnosed as ND, people with early signs of cognitive decline often get diagnosed with Mild Cognitive Impairment (MCI). They exhibit symptoms worse than those expected from normal aging but not severe enough to be diagnosed as dementia [1]. About 10% to 15% of people living with MCI will convert into having Alzheimer's Disease (the most common type of dementia) per year [2]. Accurate and early detection of MCI and ND is of great importance.
Current clinical practice uses patient history and cognitive screening instruments plus structural brain imaging to exclude other causes -also known as a rule-out approach. The availability of expert neuropsychological testing is variable and is subject to long delays. Thus it is not feasible for wide-scale screening. In research settings and in some specialist centres some 'rule-in' diagnostic tests are used but these are either expensive and/or invasive, such as Positron Emission Tomography scanning for amyloid (Amyloid-PET) or analysis of the cerebrospinal fluid for biomarkers.
Even though memory impairment is the main symptom of MCI and ND, language and speech are also affected -even decades before diagnosis [3]. Recently, automatic approaches to analysing a person's speech and language have gained traction. Language-based analysis is mostly carried out on either the manual or automatic transcripts [4,5], whereas speechbased analysis would normally be based on the acoustic signal [6][7][8][9][10][11]. In both cases, the performance of a typical classification pipeline is highly dependent on the quality of the front-end features. This paper focuses on finding better acoustic features. Conventional hand-crafted acoustic features can be classified into two classes: a group of general features, like MFCC [9] , F 0 [6], Jitter and Shimmer [7], and more specifically designed features informed by medical knowledge, like the features proposed in [10]. The general acoustic features contain information about voice quality, but cannot describe task-specific symptoms well often resulting in researcher opting to extract very long lists of features (often in the thousands) but still achieving unsatisfactory performance. On the other hand, the specially designed features require an exact translation from human's medical knowledge into mathematical expression, which can be challenging.
There are very few publicly available datasets for investigating cognitive decline, and most research is carried out on self-collected datasets which introduces a large variation in accents, background noise and the collecting device. As a result, feature sets found to be optimal for one dataset cannot necessarily display a stable performance on other datasets. Task-driven, learned features can be a better choice when the aim is generalisation. Neural networks (NNs) have proven their efficiency in various tasks as a front-end feature extractor [12,13] compared with traditional hand-crafted acoustic features. However, most of the NNs appear as a black box, which means it is harder to analyse and interpret any learned representations which could have led to meaningful insights. In this paper we introduce the SincNet as a first NN layer in order to addressed this.
The contributions of this paper are as follows: (1) a feature extractor is constructed with a SincNet-fronted NN architecture for generating task-driven acoustic features; the performance on classifying MCI and ND from HCs (healthy controls) is much improved compared with the baseline feature sets. (2) an analysis of the SincNet reveals what information has been learned while training for classification. (3) to the best of our knowledge, this is the first study that explores the critical acoustic information for cognitive decline detection from a perspective of deep learning.
In the remainder of this paper, Section 2 presents the background. Section 3 presents the designed feature extractor. Sections 4 and 5 describe the experimental setup and results, and finally, the conclusions are given in Section 6.

Background
Feature extraction is crucial for the performance of a classification system. Depending on the task and dataset, hand-crafted features might not always be the best choice. For example, the Mel-scale filter bank designed to mimic auditory and physiological evidence of how humans perceive speech signals [14] is used broadly but cannot always guarantee to be the best filter bank for the target task. Compared with hand-crafted features, the raw wave includes more information. Extracting the target information directly from the raw waveform by NNs has been an active and promising area of research, especially for mainstream speech research fields like speech recognition [15], speaker recognition [16] and emotion recognition [17].
Convolutional neural networks (CNNs), deep neural networks (DNNs), and recurrent neural networks (RNNs) (long short-term memory (LSTM) and gated recurrent units (GRU)) are three of the most popular NN structures for speech processing applications. They have different advantages. CNNs have demonstrated their ability to extract robust and invariant representations when facing the typical frequency variations of acoustic recordings by applying local filters and pooling networks [18]. RNNs are good at capturing the temporal evolution of speech signals and model the sequence information [19]. In comparison, DNNs are generally used for mapping the features from one feature space into a more separable space. In [20], it was found that combining CNNs, LSTMs, and DNNs for speech processing in a unified architecture allowed for the exploitation of their complementary natures. The attention mechanism has lately been used in different fields and achieved a great deal of success [4,21,22]. The main idea behind the attention mechanism is to apply a higher attention weight to the more critical parts of the input for classification.
The first layer is always significant for the performance of the raw wave input system as it deals with the high-dimensional and noisy input [23]. Commonly used CNNs work as a taskspecific finite impulse-response filterbank followed by a nonlinearity [13]. A novel CNN structure named SincNet has been proposed. It benefits from having fewer parameters to learn. The filters are defined with a set of parametrized sinc functions and fewer paramaters need to be trained, making it more interpretable and the ability to converge faster [16]. These characteristics make it suitable for the first layer in our system. This paper aims at building a system that can make use of the benefits of different kinds of networks for classifying ND and MCI from HCs. The dataset is a small-scale, self-collected dataset named IVA. It comprises of audio recordings of HCs and people living with MCI and ND as they interact with an Interactional Virtual Agent that asks them memory-probing questions (please, see Section 4.1 for more details about the data). In the system, a SincNet is applied as the first layer of our network followed by CNN (C), LSTM (L) and the attention mechanism (A); we refer to this network as Sinc-CLA in the following. Our results show that the network-learned features can be more distinctive and informative compared to the INTERSPEECH 2010 Paralinguistic Challenge (IS10) feature set [24] as well as the ComParE 2013 feature set [25]. The structure of the Sinc-CLA system is illustrated in Figure 1.

Task-driven Feature Extraction
In this section, the process of task-driven feature extraction is described in details. The first functional layer of the model is the SincNet layer, followed by max pooling and layer normalization. The output of the SincNet layer for filter i th , i ∈ [1, N ] in the SincNet layer is defined as follows: where x[n] is the n th chunk of the signal. g[n, fi1, fi2] is used to represent the function of the i th filter-bank. fi1 and fi2 are the low and high cut-off frequencies that need to be learned while training. The sinc function is defined as sinc(x) = sin(x)/x. To avoid the ripples in the passband and attenuation in the stop band, a Hamming window [26] is applied on g[n, fi1, fi2]. In Eq. 1, the filters are initialized with the cut off frequencies of the Mel-scale filter-bank, which has taken the human perception into consideration.
The second part of this layer is a standard 1-D convolutional layer, a max pooling layer and the layer normalization. The output H[n] of the normalization layer is used as the input to the third part, the bidirectional LSTM, which can utilize both the forward and backward information of the input. Then, an attention layer and a dense layer are applied for feature weighting and mapping. The function of the attention layer is defined as: where ht[n] is the t th component of H[n] output by LSTM. ut[n] can be regarded as the hidden representation of ht[n] through a one-layer MLP. The importance of each hidden representation is measured by the normalized similarity between ut[n] and u. The vector u can be regarded as a high-level representation of the fixed query "what is the important information in the fixed input" [27]. The system training is based on minimizing the loss between the predicted label and the ground-truth label. After training, the complete system can be regarded as the combination of the front-end feature extractor and the back-end feature classifier. To test the feature representation ability of the feature extractor, as illustrated in Figure 1 we evaluated features extracted from either the dense layer or the attention layer. More details can be found in Section 4.3.

Dataset
The IVA dataset was collected at the Department of Neurology, University of Sheffield based at the Royal Hallamshire Hospital in a real clinical setting during 2016, 2017 and 2018 [10]. A Digital Doctor (or Intelligent Virtual Agent) presented on a laptop asks a series of conversational questions and administers a series of verbal tests. The questions are designed to mimic the neurologist-patient conversation happening as part of routine diagnostic assessments. The speech sampling rate is 16KHZ. In our experiment, only the audio recordings from the participants diagnosed with ND, MCI, and HCs are used. Further information about the data is given in Table 1 and in [10]. The average duration of the recordings in the IVA dataset is about 9 minutes which is too long to utilise directly as the input of the Sinc-CLA feature extractor. A similar problem is described in [8] and they chose to segment the input with manual information. As we are aiming for a fully automatic system, we instead chose to cut the recording into 2 second chunks. Each chunk is assigned a label corresponding to its diagnostic category.

Evaluation Setting
To provide a reliable result, 10 fold cross-validation (CV) is used on the relatively small dataset and each fold is fixed for all the experiments we present. The number of recordings in the three partitions (training, development, and test) of each fold is as balanced as possible in terms of the diagnostic category. In addition, as can be seen from Table 1, some speakers contributed more than one recording and these were kept in the same partition (speaker independent).
A typical classification pipeline is used to evaluate the extracted features. The front-end features are either the baseline feature sets or the features learned by Sinc-CLA, followed by the back-end classifier. Logistic Regression (LR) and Support Vector Machine (SVM), the most commonly used classifiers in acoustic-based cognitive decline detection fields, are adopted. The parameters in SVM are set as C = 0.01, kernel type=rfb. For each data fold, the features from training and development sets (9 folds) are used to train the back-end classifier and the test set is used for evaluation. The presented result is averaged across the 10 fold test set. Both the chunk-level and recordinglevel results are evaluated. The chunk-level result is calculated as majority voting over the predicted label output for each chunk by the classifier. To verify our system, the classification tasks include HCs vs. ND, HC vs. MCI and HC vs. people living with either ND or MCI.

Model Configuration
In the feature extraction part, the segmented chunks in the training set are fed into the designed feature extractor (Sinc-CLA). The SincNet layer is composed of N=80 filters of length L=125 samples . The parameters for the filters in the SincNet layer are initialized with the cut-off frequencies of the Mel-scale filterbank as introduced in [16]. The standard convolutional layer uses 60 filters of length 5. The max-pooling size of the two convolutional layers is 3. The number of units in the bidirectional LSTM is 50. The output of the Bidirectional LSTM layer is the 100 dimensional feature, which is the concatenation of the two 50 unidirectional LSTM outputs. The dimension of the attention matrix is set as 30. The output of the attention layer is the 100 dimension vector. The dense layer composes 1024 neuron units. In the model, all hidden layers use leaky-ReLU [28] non-linearities. rmsprop [29] is applied as the optimizer with a learning rate of 0.01. While training, the mini-batch size is set to 30 and the epoch is set to 40. All the parameters of the network are selected according to the development set. F-measure is used as the criteria. After the feature extractor is trained, the 2 second chunks are input into the Sinc-CLA feature extractor. The features output by the attention layer and dense layer (named as 'attention feature' and 'dense feature' in the following) are used for the classification experiments.

Baseline Features
Research has shown promising results for using features initially proposed for emotion recognition in systems for automatic assessment of cognitive decline [8,30,31]. IS10 and Com-ParE features, which have achieved outstanding results [30,31], are adopted as the baseline feature sets in our experiment. The features are extracted by the OpenSMILE [25] toolkit. Compared with frame-level features, the statistic suprasegment feature can provide better performance on our task. To get the suprasegment feature for each 2 second chunk, the mean, maximum, minimum, median, and standard deviation are calculated across time on the frame-level feature matrix as in [9]. Then a list of 380 (76 × 5) features based on IS10 and 650 (130 × 5) features based on ComParE are generated.

Filter Analysis
Before describing the classification results, it is interesting to analyse the learned SincNet filters. Figure 2 shows the initialized and the three learned cumulative frequency responses (CFRs) of the SincNet layer. The black line corresponds to the initialised CFR, and the different coloured lines refer to different classification tasks after training. The filter sum is normalized by the highest response. The conclusion can be summarized as: 1. Compared with the initialised CFRs, more details are shown in the learned CFRs. This shows that while training the filters, task specific information has been learned. It may also explain why Mel-scale filter bank based features are less suitable for our specific task. 2. By observing the CFRs of the three tasks, it can be seen that the frequency responses concentrate on the low frequencies, which is consistent with prior knowledge [6,9]. Though the low frequency information has been taken into consideration for some hand-crafted feature designation, they cannot achieve as good results as the features learned by our designed feature extractor (shown in Section 5.2). 3. Furthermore, compared with the other two tasks, the CFRs of the low frequency zone is higher for the HC vs. ND classifier. This may mean that for more severe symptoms, as seen in the ND cohort, more concentration should be put on low frequencies for classification.
The output of the SincNet layer is a H ∈ [f rame num × f ilter num] matrix. As opposed to the CNN, the learned filters in a SincNet are ordered according to the frequency (from low to high, due to the Mel-scale initialization). The benefits of that is that the analysis of the SincNet output can help us better interpret the frequency related information which may be informative for cognitive decline assessment. To this end, the average f ilter num (80) dimensional vector for each recording is calculated by averaging H over time. In Figure 3 only the first 5 out of 80 dimensions of the average vector is plotted as they are more distinctive. Each row corresponds to the filter response of one recording. As described in [32], the increase in the power of low frequency ranges can be a result of cognitive decline. The high values and main differences are concentrated in the first several filters for the three tasks.

Classification Result
The classification results on the baseline feature sets (IS10 and ComParE), and the dense and attention features are calculated by averaging across the 10 fold CV. Both the chunk-level and recording-level F-measure is calculated and presented in Table 2 and Table 3 respectively. In Table 2, comparing with the IS10 and ComParE feature sets, the classification results of the learned dense feature and attention feature are superior for the three classification tasks we performed. For example, for the HC vs. ND task, the best chunk-level classification performance is 88.39% achieved by dense feature classified by LR, compared with 81.34% achieved by IS10 classified by LR as the best baseline result. The performance of the dense and attention features do not differ much for either of the two classi- fiers. Comparing the two tables, the performance of the features and classifiers at the recording-level is better but consistent with the performance under the same situation after majority voting on the chunk-level labels.

Conclusion
In this paper, a feature extractor named (Sinc-CLA) was designed for extracting task-driven features from the raw wave to classify recordings of people with neurodegenerative related disorders (ND, and HC). Compared with the IS10 and ComParE feature sets, the task-driven features achieved superior performance. Analyzing the CFRs of the SincNet layer gave us evidence that low-frequency information is critical for classifying MCI and ND from HC. The intuition of the learned filters and their output made the result more convincing.

Acknowledgement
This work is supported under the European Union's H2020 Marie Skłodowska-Curie programme TAPAS (Training Network for PAthological Speech processing; Grant Agreement No. 766287).