The Case of Missed Cancers: Applying AI as a Radiologist’s Safety Net

. We investigate the potential contribution of an AI system as a safety net application for radiologists in breast cancer screening. As a safety net, the AI alerts on cases suspected to be malignant which the radiologist did not recommend for a recall. We analyzed held-out data of 2,638 exams enriched with 90 missed cancers. In screening mammography settings, we show that a system alerting on 11 out of every 1,000 cases, could detect up to 10.7% of the radiologists’ missed cancers. The AI safety net was able to assist 3 out of the 5 radiologists in detecting missed cancers without raising any false alerts.


Radiologists Performance in Screening Digital Mammography
Breast cancer (BC) is the most commonly diagnosed cancer among women worldwide, and the second leading cause of cancer-related deaths. As treatment options improve, early detection may have a larger impact on morbidity and mortality. Presently, digital mammography (DM) is the most common method of screening being used globally. Women undergo a DM exam every 1-3 years depending on their familial history and national policy. These exams are then interpreted by radiologists based on the Breast Imaging Report and Data System (BIRADS). According to the breast cancer surveillance consortium (BCSC) benchmark for radiologists in DM screening [1], the average radiologist's sensitivity and specificity are 87% and 89%, respectively. While 97.1% of radiologists are within the acceptable range of sensitivity ≥75%, only 63.0% met the acceptable range for specificity of 88%-95%. Indeed, analyzing mammograms is a challenging task. Previous works have shown the agreement between radiologists to be slight to moderate at best [2][3][4]. A second reading of mammograms by an additional radiologist has been proven to increase sensitivity and specificity [5,6]. However, lack of trained radiologists, budget, and time limitations often make it inapplicable to the standard screening procedure [7]. AI systems may help close the gap of readily available second readers, but their real-world efficacy is still a matter of debate [8][9][10].

AI Performance in Screening Digital Mammography
Since June 2019, six different papers [11][12][13][14][15][16] reported results of AI models trained on retrospective screening mammography data for the detection of BC within 12 months. Most studies were based on large-scale data (ranging from an order of 15K to 150K women) and reported impressive results, often at the range of the radiologists' performance or even surpassing it. The radiologists' performance on the held-out data in the above papers (when reported) is presented in Fig. 1 by the full circles. The reported numbers vary between sensitivity of 77% (readers in Italy and the Netherlands) to 98.5% (UK consensus reader) and specificity between 67% (Italian readers) and 97.4% (Israeli readers), which is consistent with the performance of radiologists derived by BCSC [1]. The performance of AI models is illustrated by empty circles.
The ability to achieve radiologists' level in specificity and sensitivity indicates that the technology could have assisted radiologists. Unfortunately, it is almost impossible to compare the AI models. Not only that each paper is leveraging different datasets from different geographies (USA, South Korea, Israel, and multiple countries in western Europe), but also different evaluation methods and performance measures were used for reporting the results. Most importantly, some key elements that could demonstrate the usefulness of these models as part of a screening routine were often missing from the report. Namely, their expected performance in real-life settings, including what composition of cases was considered real-life settings, and the existence of false negative (FN) cases by radiologists (definition, count, and performance). Moreover, two of the six papers conducted reader studies [13,14]  In this work, we focused on a safety net application, where the aim of the technology was to alert on FN cases within 12 months from the index exam, while maintaining a low number of false alerts. Reducing AI's false alarms is key, as computer-aided diagnosis systems have been shown in the past to generate a large number of false positive findings, slowing the radiologist's work without contributing to their performance [8]. False positive recalls may induce extra costs, unnecessary anxiety and additional procedures for a healthy population [17,18]. For that purpose, our system worked in a high specificity operation point, in which it was expected to identify normal cases with high confidence.
However, high specificity operation point is not necessarily optimal for the detection of all cancer cases, let alone those that were missed by the first reader.
Here, we analyzed the overall contribution of the AI system in a safety net application to the radiologist's performance; first, on a held-out set based on the original radiologists' performance, and second, in a multi-reader study to examine its contribution to individual readers. Both the held-out data and the reader study were enriched with radiologists' FN cases. Fig. 1. Performance of radiologists and AI systems within a 12-month follow-up period as reported in recent publications. Straight lines correspond to studies who reported performance of both AI and radiologists on the same cohort. + Indicates adjusted results for screening scenario. Results with specificity below 50% and sensitivity below 75% are not shown.

Methods
This work was approved by the research ethics review board of Assuta Medical Centers (AMC), who also waived the need to obtain a written informed consent. Data were collected, managed and anonymized by Maccabi Health Services (MHS).

AI as a Safety Net Application
The AI system being used in this work was trained on the index exam DM images and detailed clinical history of the women. Its architecture consisted of an ensemble of three deep learning (DL) models on the image level, these models were used to produce a malignancy score and as feature extractors. Another machine learning model utilized the DL models' output and incorporated the clinical information (including imaging history, gynecological and familial history, medications, diagnoses, and lab results) into a final malignancy score on the breast-level. For a study-level decision, the maximum between the prediction for each side was taken. For more details see [11].
The model's output is a continuous score in the range between 0 and 1. For the system to provide a final binary decision, thresholds were set at specific operation points using a validation set from a previous study [11] without overlap with this work's datasets. One operation point was 87% sensitivity, consistent with the average reader sensitivity according to BCSC benchmark [1]. Here, we focused mostly on a second operation point of 99% specificity in a safety net application. In such an application, the AI system analyzes the cases independently of the radiologist and is activated after the radiologist's BI-RADS score assignment. When the AI system deems a case malignant, only then it checks its original BI-RADS category. If the case was assigned BI-RADS 1-2, it raises an alarm for a recall. A 99% specificity operation point was selected to reduce the number of false alarms.

Retrospective Held-Out Set
The dataset was collected from five AMC imaging facilities and from MHS database. The cohort was composed of women who underwent at least one DM examination between 2013 and 2017 consecutively and had at least one year of clinical history. Most women in AMC undergo a screening mammography biannually, with the exception of women with familial history who are offered an annual exam. Mammograms are typically read by a single fellowship-trained reader and interpreted using the BI-RADS scale. We excluded exams of women with a history of breast cancer; exams post breast operations (e.g., lumpectomy or mammoplasty); and exams of a single breast. Exams were considered positive if there was an indication of a biopsy positive for cancer or they appeared in the cancer registry within 12 months. Exams were considered negative if there was an indication of a negative biopsy within 12 months or they had a BI-RADS 1-3 index exam and a completely clean follow-up for at least two years (i.e. all follow-up exams in that period were BI-RADS 1-2, without biopsy recommendations or procedures). For each woman, the first exam meeting the inclusion/exclusion criterion was selected as the index examination. All FN in the retrospective test set were reviewed post-hoc by a breast radiology specialist with more than 20 years of experience. The retrospective test set analyzed in this work was never used to train or tune the AI system.

Reader Study
In the reader study, five AMC board-certified radiologists with breast mammography fellowship have interpreted 120 exams. Two of the readers had 20 years of experience, two had 10 years of experience, and one had more than a year of experience reading breast mammograms. Another >20 years experienced reader conducted a post-hoc review of the FN cases in the study. Readers used their regular system and screens. Readers and AI model were exposed to the index exam images and the entire set of clinical data. They had no access to previous exams or to other modalities taken in the index examination date. However, we excluded cases with high BI-RADS due only to high US BI-RADS (i.e., the retrospective DM required a recall). Additionally, when an exam had a negative biopsy indicating a healthy tissue, we made sure that there isn't a follow-up positive biopsy exam or a record in the cancer registry. Here the definition of FN was loosened in comparison to the retrospective test set, to introduce cancers that were diagnosed within a two years window. We made no distinction whether a finding was visible in DM, US, or neither. Cases were assigned in a random order to each reader, and each has covered the entire set of cases.

Bootstrapping and Statistical Analysis
Ideally, AI system's performance and contribution to a human reader should be assessed in conditions as close to real-world prevalence as possible. However, in most studies, this is not the case. Roughly 98% of mammograms in a screening population are normal. The manner in which AI models are trained often results in datasets enriched in abnormal cases. Here too, the retrospective test set is not reflective of real-world prevalence of breast cancer as reported by AMC or the BCSC benchmark [1]. The data is enriched with biopsy cases (880/2,638, 33%) and especially FN (90/2,638, 3%).
For this purpose, we utilized the entire set of clinical data in MHS (69,149 cases) to estimate real-world prevalence in the population. Performance of the original radiologist was estimated once according to DM alone ). Using these proportions, we bootstrapped with replacements a sample set of 1,000 cases in 1,000 iterations and calculated the reader's and AI's performance on the retrospective dataset. We report mean and 95% confidence interval (CI) for each measure.
Fisher exact test and Wilcoxon signed-rank test were used to evaluate significant differences in performance. For multiple hypotheses we used Benjamini-Hochberg adjustment. Inter-reader agreement was estimated using Cohen's Kappa statistic. P-values less than 0.05 were considered statistically significant.

Safety Net Application on Held-Out Retrospective Data
A held-out set of 2,638 individual exams was collected (age 55 [47-63], BMI 26 [23-30], median and interquartile range). Each exam in the dataset included the four standard mammography images as well as detailed clinical history of the women. The dataset consisted of: 1,688/2,638 (64%) BI-RADS 1-2 cases with a clean follow-up, 70/2,638 (3%) BI-RADS 3 cases with a clean follow-up, 501/2,638 (19%) negative-biopsy cases, and 379/2,638 (14%) positive-biopsy cases. The dataset was intentionally enriched in FN cases, with 24% (90/379) of cancer cases originally missed by the radiologist. A screening US exam was performed in 75% (1,967/2,638) of the cases, and BI-RADS were reported separately for DM and US. Performance of the original radiologist was estimated twice; based on DM alone and based on both modalities, when US was available. The AI system's performance was estimated at an operation point of 99% specificity as a safety net application (see Sect. 2.1). We used a bootstrapping analysis, mean and CI, to estimate performance on real-word prevalence (see Sect From the 90 cases that were missed by the radiologist, the AI safety net was able to identify 11. We asked an expert breast radiologist to review all FN cases and determine whether the malignant lesion was visible in the index exam's DM or not. The expert reader analyzed first the index exam, and only then utilized any other US or follow-up examinations images as well as pathology reports to localize the malignant finding. The AI was able to identify a larger proportion of visible cancers than not (Fisher exact test, p-value = 1.76 × 10 −2 ; see Table 1).

Reader Study
We then continued to examine the AI safety net application in a separate reader study with five certified breast radiologists from AMC. The study consisted of 120 cases (age 53 [47-65], BMI 26 [23-29], median and interquartile range), including 36/120 (30%) normal cases (original BI-RADS 1-2 with a clean two years follow-up), 13/120 (11%) original BI-RADS 3 cases without biopsy and a clean two years follow-up, 35/120 (29%) with a negative biopsy with one year, and 36/120 (30%) with a positive biopsy. The cancer cases further included 10/36 (28%) FN. The original BI-RADS assigned to the FN was either 1 or 2.
Readers were asked to assign each case a BI-RADS 1-5 score. Their answers were compared to the ground truth and individual sensitivity/specificity measures were calculated. We compared the readers performance to the AI system in two operation points: 1) 87% sensitivity, and 2) 99% specificity, for the safety net application (Fig. 2a,

Fig. 2.
Performance of the readers and AI system in the reader study. A) ROC curve of the AI system (AUC = 0.81). Sensitivity/specificity of each study reader (in blue), the original reader (purple) and the AI at 87% sensitivity and 99% specificity (green and red, respectively). B) Differences in interpretation (recall/no-recall) of cancer and non-cancer cases between pairs of readers (blue dots) and readers and AI (green squares). (Color figure online) At an operation point of 87% sensitivity, the AI system has exceeded the average radiologist performance (sensitivity 86.1% vs. 78.3%; specificity 60.7% vs. 41.9%). Interestingly, the readers' average sensitivity matched the retrospective average sensitivity estimated for AMC radiologists on DM (see Sect. 2.4). However, the retrospective specificity was much higher than the one obtained in the reader study (96% vs. 42%). Indeed, in regular settings, the radiologist could compare the index exam to previous exams of the same woman or to the US, if either existed. Moreover, the reader study is enriched with negative-biopsy cases, which are more contestable. The average level of agreement between radiologists based on BI-RADS score was only fair (Cohen's Kappa of 0.34 [0.28-0.42]), with a slight increase when it was based on recall/no-recall bins instead (0.37 [0.30-0.47]). In general, the radiologists tended to agree more between themselves on cancer cases than on non-cancer cases (Fig. 2b). The average agreement with the AI system was even lower (0.19 [0.02-0.35], Table 2). Similarly, most of the disagreement between readers and the AI system was rooted in the non-cancer cases (which the AI more often classified correctly) rather than the cancer-cases (Fig. 2b).
At an operation point of 99% specificity (sensitivity of 36.1% at specificity of 97.6%), the AI system was still able to detect four missed cancers (two cases for reader #3, one case for reader #2, and one case for reader #5). Importantly, the safety net application in the reader study did not add additional false alarms to any of the readers (Table 2).
In a post-hoc visibility analysis of then 10 FN cases (see Sect. 2.3), an independent expert has determined that 6 out of 10 FN were not visible at the index exam. Moreover, there was no association between the FN cases each reader has suspected to be malignant and their post-hoc visibility (p-value > 0.05, Fisher exact test). As such, the suspected lesions identified by the readers in those cases were most likely benign, and if biopsied, would have returned negative.

Conclusions
In this work, we evaluated an AI system as a radiologist's safety net application. The safety net has contributed significantly to reader's sensitivity, especially when the analysis was based on DM alone, but also when combined with US. In a reader study, we demonstrated that even in a challenging dataset enriched with biopsy and FN cases, a safety net application could have benefited the readers. When the AI operated at the average sensitivity level of radiologists according to the BCSC, it had a low agreement with the readers, and as such, was in a better position to give useful insights, especially when there were no prior images or US available, such as in the case of women undergoing their first exam. This analysis was not without limitations. Data originated from five different facilities of a single provider in one country, using a single mammography vendor (Hologic). In some cases, US was performed prior to the mammogram, and the radiologist may have been aware of the US report before analyzing the DM. Even so, according to the data, the existence of an US did not guarantee that the DM's BI-RADS was equal to or higher than the US's. In the reader study, the readers have operated in their regular environment, but did not have access to prior images or US, both essential tools in their daily work known to have impact on their performance. The AI system did not use those either.
To account for the lack of US, cases with original high BI-RADS due only to US were excluded from the study. Readers and AI had access to the same clinical data.
The AI safety net application was designed to interfere as little as possible with the radiologist routine; analyzing cases independently and raising a minimal amount of alarms as a second reader. Even under those restrictions, it demonstrated useful abilities. This is only one possible application of AI systems, but one we believe to be practical for immediate use.
Funding. This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No. 813533.