A two-step explainable approach for COVID-19 computer-aided diagnosis from chest x-ray images

Early screening of patients is a critical issue in order to assess immediate and fast responses against the spread of COVID-19. The use of nasopharyngeal swabs has been considered the most viable approach; however, the result is not immediate or, in the case of fast exams, sufficiently accurate. Using Chest X-Ray (CXR) imaging for early screening potentially provides faster and more accurate response; however, diagnosing COVID from CXRs is hard and we should rely on deep learning support, whose decision process is, on the other hand,"black-boxed"and, for such reason, untrustworthy. We propose an explainable two-step diagnostic approach, where we first detect known pathologies (anomalies) in the lungs, on top of which we diagnose the illness. Our approach achieves promising performance in COVID detection, compatible with expert human radiologists. All of our experiments have been carried out bearing in mind that, especially for clinical applications, explainability plays a major role for building trust in machine learning algorithms.


INTRODUCTION
Early COVID diagnosis is a key element for proper treatment of the patients and prevention of the spread of the disease. Given the high tropism of COVID-19 for respiratory airways and lung epythelium, identification of lung involvement in infected patients can be relevant for treatment and monitoring of the disease. Virus testing is currently considered the only specific method of diagnosis. Nasopharingeal swabs are easily executable and affordable and current standard in diagnostic setting; their accuracy in literature is influenced by the severity of the disease and the time from symptoms onset and is reported up to 73.3% [1]. Current position papers from radiological societies (Fleischner Society, SIRM, * This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 825111, DeepHealth Project.

COVID classifier
Baseline approach COVID diagnosis Two-steps approach Radiological report Fig. 1: Comparison between standard approaches to COVID diagnosis and our two-step approach.
RSNA) [2,3,4] do not recommend routine use of imaging for COVID-19 diagnosis; however, it has been widely demonstrated that, even at early stages of the disease, chest x-rays (CXR) can show pathological findings.
In the last year, many works attempted to tackle this problem, proposing deep learning-based strategies [5,6,7,8,9]. All of the proposed approaches include some elements in common: i) the images collected during the pandemic need to be augmented with non-COVID cases from publicly available datasets; ii) some standard pre-processing is applied to the images, like lung segmentation using U-Net [10] or similar models [5] or converting the pixels of the CXR scan in Hounsfield units; iii) the deep learning model is trained to the final diagnosis using state-of-the-art approaches for deep neural networks. Despite some very optimistic results, the proposed approaches exhibit significant limitations that deserve further analysis. For example, augmenting COVID datasets with negative cases from publicly-available datasets can inject a dangerous bias, where the trained model learn to discriminate different data sources rather than actual radiological features related to the disease [5]. These unwanted effects are difficult to spot when using a "black box" model like deep learning ones, without having control on the decision process.
In this work we propose an explainable approach, mimicking the radiologists' decision process. Towards this end, we break the COVID diagnosis problem into two sub-problems. First, we train a model to detect anomalies in the lungs. These anomalies are widely known and, following [11], comprise 14 objective radiological observations which can be found in lungs. Then, on top of these, we train a decision tree model, where the COVID diagnosis is explicit (Fig. 1). Mimicking the radiologist's decision is more robust to biases and aims at building trust for the physicians and patients towards the AI tool, which can be useful for fast COVID diagnosis. Thanks to the collaboration with the radiology units of Città della Salute e della Scienza di Torino (CDSS) and San Luigi Hospital (SLG) in Turin, we collected the COvid Radiographic images DAta-set for AI (CORDA), comprising both positive and negative COVID cases as well as a ground truth on the human radiological reporting, and it currently comprises almost 1000 CXRs.

DATASETS
In this section we introduce the datasets that will be used for our proposed approach. For our purposes we first need to detect some objective radiological findings (we train a model on the CheXpert dataset) and then, on top of those, we train a model to elaborate the COVID diagnosis (using the CORDA dataset).

CheXpert: this is a large dataset comprising about 224k
CXRs. This dataset consists of 14 different observations on the radiographic image: differently from many other datasets which are focused on disease classification based on clinical diagnosis, the main focus here is "chest radiograph interpretation", where anomalies are detected [12]. The learnable radiological findings are summarized in Fig. 2.
CORDA: this dataset was created for this study by retrospectively selecting chest x-rays performed at a dedicated Radiology Unit in CDSS and at SLG in all patients with fever or respiratory symptoms (cough, shortness of breath, dyspnea) that underwent nasopharingeal swab to rule out COVID-19 infection. Patients' average age is 61 years (range 17-97 years old). It contains a total of 898 CXRs and can be split by different collecting institution into two similarly sized subgroups: CORDA-CDSS [5], which contains a total of 447 CXRs from 386 patients, with 150 images coming from COVID-negative patients and 297 from positive ones, and CORDA-SLG, which contains the remaining 451 CXRs, with 129 COVID-positive and 322 COVID-negative images. Including data from different hospitals at test time is crucial to doublecheck the generalization capability of our model. The data collection is still in progress, with other 5 hospitals in Italy willing to contribute at time of writing. We plan to make CORDA available for research purposes according to EU regulations as soon as possible.

RADIOLOGICAL REPORT
In this section we are going to describe our proposed method to extract radiological findings from CXRs. For this task, we leverage the large scale dataset CheXpert, which contains annotation for different kinds of common radiological findings that can be observed in CXR images (like opacity, pleural effusion, cardiomegaly, etc.). Given the high heterogeneity and the high cardinality of CheXpert, its use is perfect for our purposes: in fact, once the model is trained on this dataset, there is no need to fine-tune it for the COVID diagnosis, since it will already extract objective radiological findings. CheXpert provides 14 different types of observations for each image in the dataset. For each class, the labels have been generated from radiology reports associated with the studies with NLP techniques, conforming to the Fleischner Society's recommended glossary [11], and marked as: negative (N), positive (P), uncertain (U) or blank (N/A). Following the relationship among labels illustrated in Fig. 2, as proposed by [12], we can identify 8 top-level pathologies and 6 child ones.

Dealing with uncertainty
In order to extract the radiological findings from CXRs, a deep learning model is trained on the 14 observations. Towards this end, given the possibility of having multiple findings in the same CXR, the weighted binary cross entropy loss is used to train the model. Typically, weights are used to compensate class unbalancing, giving higher importance to lessrepresented classes. Within CheXpert, however, we also need to tackle another issue: how to treat the samples with the U label. Towards this issue, multiple approaches have been suggested by [12]. The most popular is to ignore all the uncertain samples, excluding them from the training process and considering them as N/A. We propose to include the U samples in the learning process, mapping them to maximum uncertainty (probability 0.5 to be P or N). Then, we balance P and N outcomes for every radiological finding. Table 1 shows a performance comparison between the standard approach as proposed by [12] and our proposal (U-label use), for 5 salient radiological findings, using the same setting as in [12]. We observe an overall improvement in the performance, which is expected by the inclusion of the U-labeled examples. For all our experiments, we will use models trained using the U labeled samples.

COVID DIAGNOSIS
The second step of the proposed approach is building the model which can actually provide a clinical diagnosis for COVID. We freeze the model obtained from Sec. 3 and use its output as image features to train a new binary classifier on the CORDA dataset. We test two different types of classifiers: a decision tree (Tree) and a neural network-based classifier (FC). The decision tree is trained on the probabilities output of the radiological reports, using the state-of-the-art CART Algorithm implementation provided by the Python scikitlearn [13] package. Besides the fully explainable decision tree-based result, we also train a neural network classifier, comprising one hidden layer of size 512 and the output layer. Despite working with the same features as the decision tree, such an approach loses in explainability, but potentially enhances the performance in terms of COVID diagnosis, as we will see in Sec. 5.

RESULTS
In this section we compare the COVID diagnosis generalization capability through a direct deep learning-based approach (baseline) and our proposed two-step diagnosis, where first we detect the radiological findings, and then we discriminate patients affected by COVID using a decision tree-based diagnosis (Tree) or a deep learning-based classifier from the ra-diological findings (FC). The performance is tested on a subset of patients not included in the training / validation set. The assessed metrics are: balanced accuracy (BA), sensitivity, specificity and area under the ROC curve (AUC). For all of the methods we adopt a 70%-30% train-test split. For the deep learning-based strategy, SGD is used with a learning rate 0.01 and a weight decay of 10 −5 .
All of the experiments were run on NVIDIA Tesla T4 GPUs using PyTorch 1.4. Table 2 compares the standard deep learning-based approach [5] to our two-steps diagnosis. Baseline results are obtained pre-training the model on some of the most used publicly-available datasets. We observe that the best achievable performance is very low, consisting in a BA of 0.67. A key takeaway is that trying to directly diagnose diseases such as COVID-19 from CXRs might be currently infeasible, probably given the small dataset sizes and strong selective bias in the datasets. We can clearly see how the two-step method outperforms the direct diagnosis: using the same network architecture (ResNet-18 as backbone and a fully-connected classifier on top of it), we obtain a significant increase in all of the assessed metrics. Even better results are achieved by using a DenseNet-121 as backbone and the fully-connected classifier. Fig. 3 graphically shows the learned decision tree (whose performance is shown in Table 2): this provides a very clear interpretation for the decision process. From the clinical and radiological perspective, these data are consistent with the COVID-19 CXR semiotics that radiologists are used to deal with. The edema feature, although unspecific, is strictly related to the interstitial involvement that is typical of COVID-19 infections and it has been largely reported in the recent literature [14]. Indeed, in recent COVID-19 radiological papers, interstitial involvement has been reported as ground glass opacity appearance [15]. However this definition is more pertinent to the CT imaging setting rather than CXR; the "edema" feature can be compatible, from the radiological perspective, to the interstitial opacity of COVID-19 patients. Furthermore, the not irrelevant role of cardiomegaly (or more in general enlarged cardiomediastinum) in the decision tree can be interesting from the clinical perspective. In fact, this can be read as an additional proof that established cardiovascular disease can be a relevant risk factor to develop COVID-19 [16]. Moreover, it may be consistent with the hypotheses of a larger role of the primary cardiovascular damage observed on on preliminary data of autopsies of COVID-19 patients [17]. Focusing on the deep learning-based approach (FC) we ob-   serve a boost in the performance, achieving a BA of 0.75. However, this is the result of a trade-off between interpretability and discriminative power. Using Grad-CAM [18] we have hints on the area the model focused on to take the final diagnostic decision. From Fig. 4 we observe that on COVIDpositive images, the model seems to mostly focus on the expected lung areas. Finally, to further test the reliability of our approach, we used our strategy also on CORDA-SLG (which are data coming from a different hospital structure), reaching comparable and encouraging results.

CONCLUSIONS
One of the latest challenges for both the clinical and the AI community has been applying deep learning in diagnosing COVID from CXRs. Recent works suggested the possibility of successfully tackling this problem, despite the currently small quantity of publicly available data. In this work we propose a multi-step approach, close to the physicians' diagnostic process, in which the final diagnosis is based upon detected lung pathologies. We performed our experiments on CORDA, a COVID-19 CXR dataset comprising approximately 1000 images. All of our experiments have been carried out bearing in mind that, especially for clinical applications, explainability plays a major role for building trust in machine learning algorithms, although better interpretability can come at the cost of a lower prediction accuracy.