Planned intervention: On Thursday March 28th 07:00 UTC Zenodo will be unavailable for up to 5 minutes to perform a database upgrade.
Published March 2, 2021 | Version v1
Other Open

Medical Out-of-Distribution Analysis Challenge 2021

  • 1. Div. Medical Image Computing (MIC), German Cancer Research Center (DKFZ)
  • 2. Div Computer Assisted Medical Interventions (CAMI), German Cancer Research Center (DKFZ)

Description

Despite overwhelming successes in recent years, progress in the field of biomedical image computing still largely depends on the availability of annotated training examples. This annotation process is often prohibitively expensive because it requires the valuable time of domain experts. Additionally, this approach simply does not scale well: whenever a new imaging modality is created, acquisition parameters change. Even something as basic as the target demographic is prone to changes, and new annotated cases have to be created to allow methods to cope with the resulting images. Image labeling is thus bound to become the major bottleneck in the coming years. Furthermore, it has been shown that many algorithms used in image analysis are vulnerable to out-of-distribution samples, resulting in wrong and overconfident decisions [20, 21, 22, 23]. In addition, physicians can overlook unexpected conditions in medical images, often termed ‘inattentional blindness’. In [1], Drew et al. noted that 50% of trained radiologists did not notice a gorilla image, rendered into a lung CT scan when assessing lung nodules.

One approach, which does not require labeled images and can generalize to unseen pathological conditions, is Out-of-Distribution or anomaly detection (which in this context is used interchangeably). Anomaly detection can recognize and outline conditions that have not been previously encountered during training and thus circumvents the time-consuming labeling process and can therefore quickly be adapted to new modalities. Additionally, by highlighting such abnormal regions, anomaly detection can guide the physicians’ attention to otherwise overlooked abnormalities in a scan and potentially improve the time required to inspect medical images.

However, while there is a lot of recent research on improving anomaly detection [8, 9, 10, 11, 12, 13, 14, 15, 16, 17], especially with a focus on the medical field [4, 5, 6, 7], a common dataset/ benchmark to compare different approaches is missing. Thus, it is currently hard to have a fair comparison of different proposed approaches. While in the last few months common datasets for natural data were proposed, such as default detection [3] or abnormal traffic scene detection [2], we tried to tackle this issue for medical imaging with last year's challenge [25]. In a similar setting to last year we suggest the medical out-of-distribution challenge as a standardized dataset and benchmark for anomaly detection. We propose two different tasks. First a sample-wise (i.e. patients-wise) analysis, thus detecting out-of-distribution samples. For example, having a pathological condition or any other condition not seen in the training-set. This can pose a problem to classically supervised algorithms and detection of such could further allow physicians to prioritize different patients. Secondly, we propose a voxel-wise analysis i.e. giving a score for each voxel, highlighting abnormal conditions and potentially guiding the physician.

However, there are a few aspects to consider when choosing an anomaly detection dataset. First, as in reality, the types of anomalies should not be known beforehand. This can be a particular problem when choosing a dataset and testing on only a single pathological condition, which is vulnerable to exploitation. Even with an educated guess (based on the dataset) and a fully supervised segmentation approach, trained on a not allowed separate dataset, one could outperform other rightfully trained anomaly detection approaches. Furthermore, making the exact types of anomalies known can cause a bias in the evaluation. Studies have shown that proposed anomaly detection algorithms tend to overfit on a given task, given that properties of the test set and the kind of anomalies are known beforehand. This further hinders the comparability of different algorithms [6, 18, 19, 23]. As a second point, combining test sets, from different sources with alternative conditions, may also cause problems. By definition, the different sources already propose a distribution shift to the training dataset, complicating a clean and meaningful evaluation.

To solve these issues we propose to provide two datasets with more than 600 scans each, one brain MRI-dataset and one abdominal CT-dataset, to allow for a comparison of the generalizability of the approaches. In order to prevent overfitting on the (types of) anomalies existing in our test set, the test set will be kept confidential at all times. The training set consists of hand-selected scans in which no anomalies were identified. The remaining scans will be assigned to the test set. Thus some scans in the test set do not contain anomalies, whilst others contain naturally occurring anomalies. In addition to the natural anomalies, we will add synthetic anomalies. We choose different structured types of synthetic anomalies (e.g. a tumor or an image of a gorilla rendered into the a brain scan [1]) to cover a broad variety of different anomalies and also allow for an analysis of weaknesses and strengths
of the methods by different factors (type, size, contrast, ...). We believe that this allows for a controlled and fair comparison of different algorithms (as recently similarly proposed by [3]).

While in last year’s edition [25] multiple different approaches were present, among which some showed clearly superior performance for certain kinds of anomalies, generally all algorithms still failed in (different) obvious cases and especially medically relevant classes of anomalies where still lacking in (clinically relevant) performance. In this year's edition we will introduce new synthetic anomalies and furthermore focus more on clinically relevant and not very well performing anomalies from last year's challenge. Consequently we will completely renew the synthetic part of the test set, keeping last years classes of anomalies (however generating new test samples for those) and introduce new anomaly classes, which we found as relevant judging from last year's challenge.

We hope that providing a such standardized dataset allows for a fair comparison of different approaches and can  outline how well different approaches work in a realistic and clinical setting.

[1] Drew, Trafton, Melissa L. H. Vo, and Jeremy M. Wolfe. “‘The Invisible Gorilla Strikes Again: Sustained Inattentional Blindness in Expert Observers.’” Psychological Science 24, no. 9 (September 2013): 1848–53. https://doi.org/10.1177/0956797613479386.

[2] Bergmann, Paul, Michael Fauser, David Sattlegger, and Carsten Steger. “MVTec AD -- A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection,” 9592–9600, 2019. http://openaccess.thecvf.com/content_CVPR_2019/html/Bergmann_MVTec_AD_--_A_Comprehensive_Real-World_Dataset_for_Unsupervised_Anomaly_CVPR_2019_paper.html.

[3] Hendrycks, Dan, Steven Basart, Mantas Mazeika, Mohammadreza Mostajabi, Jacob Steinhardt, and Dawn Song. “A Benchmark for Anomaly Segmentation.” ArXiv:1911.11132 [Cs], November 25, 2019. http://arxiv.org/abs/1911.11132.

[4] Chen, Xiaoran, Nick Pawlowski, Martin Rajchl, Ben Glocker, and Ender Konukoglu. “Deep Generative Models in the Real-World: An Open Challenge from Medical Imaging.” CoRR abs/1806.05452 (2018).

[5] Baur, Christoph, Benedikt Wiestler, Shadi Albarqouni, and Nassir Navab. “Deep Autoencoding Models for Unsupervised Anomaly Segmentation in Brain MR Images.” ArXiv:1804.04488 [Cs], April 12, 2018. http://arxiv.org/abs/1804.04488.

[6] Zimmerer, David, Fabian Isensee, Jens Petersen, Simon Kohl, and Klaus Maier-Hein. “Unsupervised Anomaly Localization Using Variational Auto-Encoders.” In International Conference on Medical Image Computing and Computer-Assisted Intervention, 289–297. Springer, 2019.

[7] Schlegl, Thomas, Philipp Seeböck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg Langs. “Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery,” n.d. https://arxiv.org/pdf/1703.05921.pdf.

[8] Abati, Davide, Angelo Porrello, Simone Calderara, and Rita Cucchiara. “Latent Space Autoregression for Novelty Detection.” ArXiv:1807.01653 [Cs], July 4, 2018. http://arxiv.org/abs/1807.01653.

[9] Ahmed, Faruk, and Aaron Courville. “Detecting Semantic Anomalies.” ArXiv:1908.04388 [Cs], August 13, 2019. http://arxiv.org/abs/1908.04388.

[10] Akçay, Samet, Amir Atapour-Abarghouei, and Toby P. Breckon. “Skip-GANomaly: Skip Connected and Adversarially Trained Encoder-Decoder Anomaly Detection.” ArXiv:1901.08954 [Cs], January 25, 2019. http://arxiv.org/abs/1901.08954.

[11] Beggel, Laura, Michael Pfeiffer, and Bernd Bischl. “Robust Anomaly Detection in Images Using Adversarial Autoencoders.” ArXiv:1901.06355 [Cs, Stat], January 18, 2019. http://arxiv.org/abs/1901.06355.

[12] Bergmann, Paul, Michael Fauser, David Sattlegger, and Carsten Steger. “Uninformed Students: Student-Teacher Anomaly Detection with Discriminative Latent Embeddings.” ArXiv:1911.02357 [Cs], November 6, 2019. http://arxiv.org/abs/1911.02357.

[13] Choi, Hyunsun, Eric Jang, and Alexander A. Alemi. “WAIC, but Why? Generative Ensembles for Robust Anomaly Detection.” ArXiv:1810.01392 [Cs, Stat], October 2, 2018. http://arxiv.org/abs/1810.01392.

[14] Guggilam, Sreelekha, S. M. Arshad Zaidi, Varun Chandola, and Abani Patra. “Bayesian Anomaly Detection Using Extreme Value Theory.” ArXiv:1905.12150 [Cs, Stat], May 28, 2019. http://arxiv.org/abs/1905.12150.

[15] Maaløe, Lars, Marco Fraccaro, Valentin Liévin, and Ole Winther. “BIVA: A Very Deep Hierarchy of Latent Variables for Generative Modeling.” ArXiv:1902.02102 [Cs, Stat], February 6, 2019. http://arxiv.org/abs/1902.02102.

[16] Piciarelli, Claudio, Pankaj Mishra, and Gian Luca Foresti. “Image Anomaly Detection with Capsule Networks and Imbalanced Datasets.” ArXiv:1909.02755 [Cs], September 6, 2019. http://arxiv.org/abs/1909.02755.

[17] Sabokrou, Mohammad, Mohammad Khalooei, Mahmood Fathy, and Ehsan Adeli. “Adversarially Learned One-Class Classifier for Novelty Detection.” ArXiv:1802.09088 [Cs], February 25, 2018. http://arxiv.org/abs/1802.09088.

[18] Goldstein, Markus, and Seiichi Uchida. “A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data.” PLOS ONE 11, no. 4 (April 19, 2016): e0152173. https://doi.org/10.1371/journal.pone.0152173.

[19] Škvára, Vít, Tomáš Pevný, and Václav Šmídl. “Are Generative Deep Models for Novelty Detection Truly Better?” ArXiv:1807.05027 [Cs, Stat], July 13, 2018. http://arxiv.org/abs/1807.05027.

[20] Hendrycks, Dan, and Kevin Gimpel. “A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks.” ArXiv:1610.02136 [Cs], October 7, 2016. http://arxiv.org/abs/1610.02136.

[21] Mehrtash, Alireza, William M. Wells III, Clare M. Tempany, Purang Abolmaesumi, and Tina Kapur. “Confidence Calibration and Predictive Uncertainty Estimation for Deep Medical Image Segmentation.” ArXiv:1911.13273 [Cs, Eess], November 29, 2019. http://arxiv.org/abs/1911.13273.

[22] Roady, Ryne, Tyler L. Hayes, Ronald Kemker, Ayesha Gonzales, and Christopher Kanan. “Are Out-of-Distribution Detection Methods Effective on Large-Scale Datasets?” ArXiv:1910.14034 [Cs], October 30, 2019. http://arxiv.org/abs/1910.14034.

[23] Shafaei, Alireza, Mark Schmidt, and James J. Little. “A Less Biased Evaluation of Out-of-Distribution Sample Detectors.” ArXiv:1809.04729 [Cs, Stat], August 20, 2019. http://arxiv.org/abs/1809.04729.

[24] Maier-Hein, L., Eisenmann, M., Reinke, A. et al. “Why rankings of biomedical image analysis competitions should be interpreted with care.” Nat Commun 9, 5217 (2018). https://doi.org/10.1038/s41467-018-07619-7

Files

MedicalOut-of-DistributionAnalysisChallenge2021_02-11-2021_10-28-29.pdf