Cough-based COVID-19 Detection with Contextual Attention Convolutional Neural Networks and Gender Information

The aim of this contribution is to automatically detect COVID-19 patients by analysing the acoustic information embedded in coughs. COVID-19 affects the respiratory system, and, consequently, respiratory-related signals have the potential to contain salient information for the task at hand. We focus on analysing the spectrogram representations of cough samples with the aim to investigate whether COVID-19 alters the frequency content of these signals. Furthermore, this work also assesses the impact of gender in the automatic detection of COVID-19. To extract deep-learnt representations of the spectrograms, we compare the performance of a cough-speciﬁc, and a Resnet18 pre-trained Convolutional Neural Network (CNN). Additionally, our approach explores the use of contextual attention, so the model can learn to highlight the most relevant deep-learnt features extracted by the CNN. We conduct our experiments on the dataset released for the Cough Sound Track of the D I COVA 2021 Challenge. The best performance on the test set is obtained using the Resnet18 pre-trained CNN with contextual attention, which scored an Area Under the Curve (AUC) of 70.91% at 80% sensitivity.


Introduction
The outbreak of the Coronavirus Disease 2019 (COVID-19) has dramatically stressed the health systems worldwide. At the time of writing, the World Health Organization (WHO) reports more than 175.3 M confirmed cases, and more than 3.7 M confirmed deaths of COVID-19 across the globe. Despite the vaccines, massive population screenings will still be needed to control the spread of this disease, and its strains. Current medical diagnostic tools are time consuming, and burden public expenditures. Thus, there is an opportunity to develop new digital, diagnostic tools to improve the monitoring, and the early detection of COVID-19 at a large scale cost-effectively.
One of the core elements for effective digital health solutions is the use of Artificial Intelligence (AI). AI-based systems have been successfully employed to detect coughs or sneezes [1], or to analyse breath signals [2], among others. Furthermore, AI has also been used in the field of mental health, providing solutions to recognise mental illnesses, such as depression [3,4,5] or Post-Traumatic Stress Disorder (PTSD) [6]. The current context of the pandemic challenges researchers to focus on the development of automatic COVID-19 detection tools.
The symptomatology of COVID-19 presents affectations in the respiratory system. In this direction, the research community has already started investigating the use of AI techniques to analyse lung-based information through chest X-ray images [7,8,9,10,11] or CT scans [12,13,14]. Moreover, related works in the literature explored the use of respiratory-related body signals under the assumption that the acoustics of these signals have a high potential to contain salient information to diagnose COVID-19 patients. The signals considered include breaths [15,16], coughs [17,18], and even speech [19,20].
This paper presents our contribution to the Cough Sound Track of the Diagnosing COVID-19 using Acoustics (DICOVA) 2021 Challenge [21]. We opt for analysing the spectrogram representations of the cough signals with the aim to i) investigate whether COVID-19 symptomatology alters the frequency content of coughs, and ii) assess the impact of gender 1 in the automatic detection of COVID-19 patients. Our approach relies on Convolutional Neural Networks (CNNs) to extract salient information from the spectrograms, combined with Fully Connected (FC) layers responsible for the classification of the embedded features learned. Our approach also explores the use of contextual attention, so the network learns to highlight the most relevant embedded features for the task at hand.
The rest of the paper is laid out as follows. Section 2 describes the dataset analysed, while Section 3 details the methodology followed. Section 4 compiles and analyses the results obtained from the experiments performed, and Section 5 concludes the paper and suggests potential future work directions.

Dataset
The dataset used in this work was released for the Cough Sound Track of the DICOVA Challenge 2021 [21]. This dataset consists of cough sounds recorded from COVID-19 positive and non-COVID-19 (healthy) patients, as well as their associated metadata, i. e. , COVID-19 status (positive or negative), gender, and nationality. No information about symptomatic/asymptomatic COVID-19 positive patients is provided in the dataset. The cough recordings are split into a training and a test partition. The former contains the ground truth COVID-19 status, while the latter is blind to the participants. To assess the performance of the models, the Challenge organisers require following a 5-fold cross-validation method, and distribute the recordings that belong to each fold.
The training partition is composed of 1 040 audio recordings of different durations, ranging from ca. 1 sec up to 15 sec, with an average duration of 4.7 sec. Male and female patients recorded 791 and 249 samples, respectively. The test partition contains a total of 233 audio recordings: 171 and 62 from male and female Figure 1: Diagram illustrating the system implemented, which receives a cough sample as input, and outputs the probability of the input sample to correspond to a COVID-19 or a healthy patient. The feature extraction of the segmented spectrograms is performed with a convolutional neural network. The most relevant embedded features are highlighted using a contextual attention mechanism, before the final classification using two stacked fully connected layers. The provided audio files are sampled at 44.1 kHz. Our preliminary spectral analysis of a subset of the recordings revealed that a substantial amount of them does not have any frequency content above 8 kHz. We hypothesise a potential reason to explain this is the use of low-quality equipment by patients when recording their cough samples, e. g. , with mobile devices. Therefore, we resample all audio files to a common sampling rate of 8 kHz to account for the diversity of devices used for recording. Besides, the lack of frequency content above 8 kHz results in a dark patch in the spectrogram representations of the corresponding samples, which creates noise in the training data.

Methodology
This section describes the methodology, illustrated in Figure 1. Section 3.1 details the preprocessing applied to the cough samples, Section 3.2 introduces the models implemented, and Section 3.3 summarises the parameters used to train the networks.

Data Preparation
This section details the data preparation stage of our approach, which has several steps: silence removal, feature extraction, data patch generation, and data augmentation.

Sound Activity Detection
Each audio sample in the dataset contains a sequence of coughs. A short amount of silence separates consecutive cough samples within each sequence. We consider these silent regions to be irrelevant in the detection of COVID-19, and, therefore, we use a Sound Activity Detector (SAD) to filter them out. After the resampling step, the audio files are subsequently passed through a SAD based on the Root-Mean-Square (RMS) value of the audio samples in the time domain [22,23,24]. We compute the RMS using the librosa Python library [25] and a frame length of 64 msec. We use min-max normalisation to scale each audio file's RMS, and we discard all frames below a threshold of 0.1 (set empirically). After the SAD step, we concatenate all frames above the threshold, and save the result as a new audio file for further processing. As an additional experiment, we compared the RMS-based SAD to a SAD based on spectral flux [26,27], which detects abrupt changes in the spectral domain. Although cough is an example of such a change, the preliminary exploration of the results using both methods showed that the RMS-based SAD worked better in this context. Note that to assess the effectiveness of silence removal in the detection of COVID-19 patients, our experiments use both the original and the cough-only audio files. Details about the models trained, and the results obtained are given in Section 3.2 and Section 4, respectively.

Feature Extraction and Patch Generation
Our approach uses the spectrogram as the input representation of the cough samples. We use the Short-Time Fourier Transform (STFT) function from librosa to calculate the spectrogram of each cough sample in the dataset. We use a window size of 1 024 samples (128 msec), and a hop length of 128 samples (16 msec). With this configuration, we extract the spectrograms using different parameters to compare their impact on the final results: we compare the spectrograms with a linear or logarithmic frequency scale, and two different colour maps, namely, viridis and magma. The colour map parameter is especially relevant, because spectrograms are exported as images of 256 × 256 pixels for further use. The preliminary experiments conducted to assess the impact of these parameters, not reported in this work, were not conclusive enough. Nonetheless, analysing the trends in the results obtained, we decided to focus our investigation on the spectrogram representations of the cough samples using the logarithmic frequency scale, and magma as the colour map. To overcome the different duration of the samples in the dataset, we fix the length of the cough samples to be fed into the models. We decide modelling the cough samples using acoustic frames of 1 sec length. Hence, the last step of the data preparation stage is the segmentation of all spectrograms into 1 sec length patches with a 50 % overlap. With this strategy, several patches from a single cough sample are used for training the models.

Data Augmentation
To address the imbalance between positive and negative examples (cf. Section 2), which impacts the number of spectrograms generated from the patches of the cough samples defined, we use data augmentation. Specifically, we increase the number of positive examples via replication, i. e. , including copies of the positive spectrograms to balance the dataset. We considered other forms of augmentation, such as filtering or additive noise. However, since it is not clear yet which kind of information from the audio is relevant for the task at hand, we decided not to alter the acoustic content in any way. Although replication introduces redundancy in the training set, we believe it is useful when the number of positive and negative examples differs significantly.

Models Description
This section presents the neural networks used to model the cough samples to detect COVID-19 patients. While Section 3.2.1 describes the architecture of the networks implemented, Section 3.2.2 details the procedure for our networks to consider gender information.

Network Architectures
The networks trained to detect COVID-19 from cough samples are composed of a first block, extracting embedded representations from the input spectrograms, and a second block, focusing on the classification of the embedded features depending on whether they belong to healthy or COVID-19 patients. For the latter, we employ two FC layers with 128 and 2 output neurons, respectively. While the first layer uses a Rectified Linear Unit (ReLU) as the activation function, the second one uses Softmax, so the network outputs can be interpreted as probability scores. As our networks' inputs are spectrograms, the extraction of the embedded representations is implemented using CNNs. Specifically, we compare the performance of a cough-specific CNN trained from scratch with the performance of a ResNet18 pre-trained CNN [28]. The cough-specific CNN is implemented with two convolutional blocks with 32 and 64 channels, respectively, a square kernel of 3 × 3, and a stride of 1. Both blocks implement batch normalisation, and use ReLU as the activation function. While the first block includes a 2 × 2 max-pooling, the second one uses adaptive average pooling, so the learnt feature map has a dimension of 2 × 2.
To highlight the salient information from the embedded representations learnt, we include a contextual attention mechanism (adapted from [29] and [4]) between the two blocks of the network. Representing the deep features learnt as h, the contextual attention mechanism is mathematically defined as follows: where W, b, and uc are parameters to be learnt by the network. The parameter uc can be interpreted as the context vector. The attention-based representation obtainedh is then fed into the FC layers for classification.

Gender Awareness
Assessing the impact of gender in the automatic detection of COVID-19 patients is also one of this work's goals. To address this, we explore three different network configurations. The first one does not consider any gender information, and is used as a baseline for our experiments. The second one, referred to as gender-based models in our experiments, includes an encoded representation of the patients' gender, which is concatenated with the deep-learnt features extracted. Both are fed into the FC layers of the network. The third and last configuration, referred to as gender-specific models in our experiments, trains genderspecific models, so female and male coughs are analysed with models trained using samples from patients of the same gender.

Networks Training
All models are trained to minimise the Categorical Cross-Entropy Loss, using Adam as the optimiser with a fixed learning rate of 1e −3 . Network parameters are updated in batches of 32 samples, and trained during a maximum of 100 epochs. We implement an early stopping mechanism to stop training when the validation loss does not improve for ten consecutive epochs. To assess the models, we follow a 5-fold cross-validation approach, as defined by the challenge organisers. Each fold is trained during a specific number of epochs. Therefore, when modelling all training material and to prevent overfitting, the training epochs are determined by computing the median of the training epochs processed in each fold.

Experimental Results
Our models estimate the probability of the input cough to correspond to a COVID-19 patient. Using these probabilities and a set of thresholds between 0 and 1, we can compute the Receiver Operating Characteristic (ROC) curve. The ROC curve plots the evolution of the True Positive Rate (TPR) against the False Positive Rate (FPR). The TPR, also referred to as sensitivity, corresponds to the percentage of positive examples correctly identified, i. e. , the True Positives (TP). The FPR refers to the percentage of negative examples identified as positive, i. e. , the False Positives (FP). Using the ROC curve, we quantify the models' performance using the Area Under the Curve (AUC) as our primary evaluation metric. We fix the model sensitivity at 80 %, and compute the model specificity as an additional measure of As described in Section 3.1.2, several fixed-length spectrograms can be extracted from a single cough sample. Thus, at inference time, several probability scores can be predicted for a single sample. To overcome this issue, we compute the probability of a specific sample to belong to a COVID-19 patient as the median of the probabilities inferred from the corresponding spectrograms. The results obtained when assessing the models trained using a cough-specific CNN without and with contextual attention are summarised in Tables 1 and 2, respectively. The results obtained when assessing the models trained using the pretrained ResNet18 CNN without and with contextual attention are compiled in Tables 3 and 4, respectively.
One of our experiments' main insights is that models that incorporate gender information outperform the baseline model in most of the cases. In this task, the gender of the patient is especially relevant: the vocal apparatus of females and males has a different shape and size, which results in significant differences both in the timbre and frequency range of the respiratory-related signals. We obtained the best performance with the genderbased model in three of the four scenarios investigated. Besides, although in Table 4 the gender-specific model achieved the highest AUC on the test partition, the corresponding AUC for the gender-based model is only 1 % lower, suggesting an equivalent performance. Further gender-focused evaluations using the model with the best performance on the test partition reported an AUC of 67.98 % and 50.00 % on the validation partition for male and female patients, respectively, highlighting the relevance of gender in this task. Hence, intuitively, gender-specific models should work better, but these cannot be fairly studied because of the imbalance of the data in terms of gender.
When we compare the results between cough-only and original audio files, we observe a clear difference: interestingly, the best performances using cough-only audio files on the test set were obtained with the cough-specific CNN (cf. Tables 1 and 2), while the original audio files scored the highest AUC using the pre-trained Resnet18 CNN (cf. Tables 3 and 4). One potential reason behind these differences is that ResNet18 is a pre-trained network for image classification, and not directly related to acoustics. In general, images can be quite heteroge-

Conclusions
This work presented our contribution to the Cough Sound Track of the DICOVA Challenge 2021, which addressed the automatic detection of COVID-19 patients from cough samples. Emphasising on the impact of gender, our approach focused on the extraction of deep features from the spectrogram representations of coughs using CNNs in combination with a contextual attention mechanism. Specifically, we compared the performance of a cough-specific CNN, and a pre-trained ResNet18 CNN. A gender-specific pre-trained Resnet18 CNN with contextual attention scored the highest performance on the test set, with an AUC of 70.91 %. Globally, the obtained results support the use of gender-based models, highlighting the impact of gender in the detection of COVID-19 from coughs. The best cough-specific CNNs exploiting cough-only audio files achieved an AUC of 59.04 % and 61.62 % without and with contextual attention, respectively. The best pre-trained Resnet18 CNNs exploiting the original audio files obtained an AUC of 68.95 % and 69.89 % without and with contextual attention, respectively.
As future work, other state-of-the-art pre-trained CNNs in the computer vision or computer audition domains could be investigated to extract deep features from the spectrograms. Further research could also explore deeper cough-specific CNNs to extract more relevant deep features. Regardless of the technology, the medical research in the symptomatology of COVID-19 will provide valuable insights to develop more effective systems.

Acknowledgements
This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements No. 826506 (sustAGE) and No. 770376 (TROMPA). Further funding has been received from the FI Predoctoral Grant 2018FI-B01015 from AGAUR, Generalitat de Catalunya.