Other Open Access
David Zimmerer; Jens Petersen; Gregor Köhler; Paul Jäger; Peter Full; Tobias Roß; Tim Adler; Annika Reinke; Lena Maier-Hein; Klaus Maier-Hein
This is the challenge design document for the "Medical Out-of-Distribution Analysis Challenge", accepted for MICCAI 2020.
Despite overwhelming successes in recent years, progress in the field of biomedical image computing still largely depends on the availability of annotated training examples. This annotation process is often prohibitively expensive because it requires the valuable time of domain experts. Additionally, this approach simply does not scale well: whenever a new imaging modality is created, acquisition parameters change. Even something as basic as the target demographic is prone to changes, and new annotated cases have to be created to allow methods to cope with the resulting images. Image labeling is thus bound to become the major bottleneck in the coming years. Furthermore, it has been shown that many algorithms used in image analysis are vulnerable to out-of-distribution samples, resulting in wrong and overconfident decisions [20, 21, 22, 23]. In addition, physicians can overlook unexpected conditions in medical images, often termed ‘inattentional blindness’. In , Drew et al. noted that 50% of trained radiologists did not notice a gorilla image, rendered into a lung CT scan when assessing lung nodules.
One approach, which does not require labeled images and can generalize to unseen pathological conditions, is Out-of-Distribution or anomaly detection (which in this context is used interchangeably). Anomaly detection can recognize and outline conditions that have not been previously encountered during training and thus circumvents the time-consuming labeling process and can therefore quickly be adapted to new modalities. Additionally, by highlighting such abnormal regions, anomaly detection can guide the physicians’ attention to otherwise overlooked abnormalities in a scan and potentially improve the time required to inspect medical images.
However, while there is a lot of recent research on improving anomaly detection [8, 9, 10, 11, 12, 13, 14, 15, 16, 17], especially with a focus on the medical field [4, 5, 6, 7], a common dataset/ benchmark to compare different approaches is missing. Thus, it is currently hard to have a fair comparison of different proposed approaches. While in the last few months common datasets for natural data were proposed, such as default detection  or abnormal traffic scene detection , medical imaging still misses such a common benchmark.
We suggest the medical out-of-distribution challenge as a standardized dataset and benchmark for anomaly detection. We propose two different tasks. First a sample-wise (i.e. patients-wise) analysis, thus detecting out-ofdistribution samples. For example, having a pathological condition or any other condition not seen in the trainingset. This can pose a problem to classically supervised algorithms and detection of such could further allow physicians to prioritize different patients. Secondly, we propose a voxel-wise analysis i.e. giving a score for each voxel, highlighting abnormal conditions and potentially guiding the physician.
However, there are a few aspects to consider when choosing an anomaly detection dataset. First, as in reality, the types of anomalies should not be known beforehand. This can be a particular problem when choosing a dataset and testing on only a single pathological condition, which is vulnerable to exploitation. Even with an educated guess (based on the dataset) and a fully supervised segmentation approach, trained on a not allowed separate dataset, one could outperform other rightfully trained anomaly detection approaches.
Furthermore, making the exact types of anomalies known can cause a bias in the evaluation. Studies have shown that proposed anomaly detection algorithms tend to overfit on a given task, given that properties of the testset and the kind of anomalies are known beforehand. This further hinders the comparability of different algorithms [6, 18, 19, 23]. As a second point, combining testsets, from different sources with alternative conditions, may also cause problems. By definition, the different sources already propose a distribution shift to the training dataset, complicating a clean and meaningful evaluation.
To solve these issues we propose to provide two datasets with more than 600 scans each, one brain MRI-dataset and one abdominal CT-dataset, to allow for a comparison of the generalizability of the approaches. In order to prevent overfitting on the (types of) anomalies existing in our testset, the testset will be kept confidential at all times. The training set is comprised of hand-selected scans in which no anomalies were identified. The remaining scans will be assigned to the testset. Thus some scans in the testset do not contain anomalies, whilst others contain naturally occurring anomalies. In addition to the natural anomalies, we will add synthetic anomalies. We choose different structured types of synthetic anomalies (e.g. a tumor or an image of a gorilla rendered into the a brain scan ) to cover a broad variety of different anomalies and also allow for an analysis of weaknesses and strengths of the methods by different factors (type, size, contrast, ...). We believe that this allows for a controlled and fair comparison of different algorithms (as recently similarly proposed by ). We hope that providing a standardized dataset allows for a fair comparison of different approaches and can outline how well different approaches work in a realistic and clinical setting.
 Drew, Trafton, Melissa L. H. Vo, and Jeremy M. Wolfe. “‘The Invisible Gorilla Strikes Again: Sustained Inattentional Blindness in Expert Observers.’” Psychological Science 24, no. 9 (September 2013): 1848–53. https://doi.org/10.1177/0956797613479386.
 Bergmann, Paul, Michael Fauser, David Sattlegger, and Carsten Steger. “MVTec AD -- A Comprehensive Real World Dataset for Unsupervised Anomaly Detection,” 9592–9600, 2019 http://openaccess.thecvf.com/content_CVPR_2019/html/Bergmann_MVTec_AD_--_A_Comprehensive_Real-World_Dataset_for_Unsupervised_Anomaly_CVPR_2019_paper.html.
 Hendrycks, Dan, Steven Basart, Mantas Mazeika, Mohammadreza Mostajabi, Jacob Steinhardt, and Dawn Song. “A Benchmark for Anomaly Segmentation.” ArXiv:1911.11132 [Cs], November 25, 2019. http://arxiv.org/abs/1911.11132.
 Chen, Xiaoran, Nick Pawlowski, Martin Rajchl, Ben Glocker, and Ender Konukoglu. “Deep Generative Models in the Real-World: An Open Challenge from Medical Imaging.” CoRR abs/1806.05452 (2018).
 Baur, Christoph, Benedikt Wiestler, Shadi Albarqouni, and Nassir Navab. “Deep Autoencoding Models for Unsupervised Anomaly Segmentation in Brain MR Images.” ArXiv:1804.04488 [Cs], April 12, 2018. http://arxiv.org/abs/1804.04488.
 Zimmerer, David, Fabian Isensee, Jens Petersen, Simon Kohl, and Klaus Maier-Hein. “Unsupervised Anomaly Localization Using Variational Auto-Encoders.” In International Conference on Medical Image Computing and Computer-Assisted Intervention, 289–297. Springer, 2019.
 Schlegl, Thomas, Philipp Seeböck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg Langs. “Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery,” n.d. https://arxiv.org/pdf/1703.05921.pdf.
 Abati, Davide, Angelo Porrello, Simone Calderara, and Rita Cucchiara. “Latent Space Autoregression for Novelty Detection.” ArXiv:1807.01653 [Cs], July 4, 2018. http://arxiv.org/abs/1807.01653.
 Ahmed, Faruk, and Aaron Courville. “Detecting Semantic Anomalies.” ArXiv:1908.04388 [Cs], August 13, 2019. http://arxiv.org/abs/1908.04388.
 Akçay, Samet, Amir Atapour-Abarghouei, and Toby P. Breckon. “Skip-GANomaly: Skip Connected and Adversarially Trained Encoder-Decoder Anomaly Detection.” ArXiv:1901.08954 [Cs], January 25, 2019. http://arxiv.org/abs/1901.08954.
 Beggel, Laura, Michael Pfeiffer, and Bernd Bischl. “Robust Anomaly Detection in Images Using Adversarial Autoencoders.” ArXiv:1901.06355 [Cs, Stat], January 18, 2019. http://arxiv.org/abs/1901.06355.
 Bergmann, Paul, Michael Fauser, David Sattlegger, and Carsten Steger. “Uninformed Students: Student-Teacher Anomaly Detection with Discriminative Latent Embeddings.” ArXiv:1911.02357 [Cs], November 6, 2019. http://arxiv.org/abs/1911.02357.
 Choi, Hyunsun, Eric Jang, and Alexander A. Alemi. “WAIC, but Why? Generative Ensembles for Robust Anomaly Detection.” ArXiv:1810.01392 [Cs, Stat], October 2, 2018. http://arxiv.org/abs/1810.01392.
 Guggilam, Sreelekha, S. M. Arshad Zaidi, Varun Chandola, and Abani Patra. “Bayesian Anomaly Detection Using Extreme Value Theory.” ArXiv:1905.12150 [Cs, Stat], May 28, 2019. http://arxiv.org/abs/1905.12150.
 Maaløe, Lars, Marco Fraccaro, Valentin Liévin, and Ole Winther. “BIVA: A Very Deep Hierarchy of Latent Variables for Generative Modeling.” ArXiv:1902.02102 [Cs, Stat], February 6, 2019. http://arxiv.org/abs/1902.02102.
 Piciarelli, Claudio, Pankaj Mishra, and Gian Luca Foresti. “Image Anomaly Detection with Capsule Networks and Imbalanced Datasets.” ArXiv:1909.02755 [Cs], September 6, 2019. http://arxiv.org/abs/1909.02755.
 Sabokrou, Mohammad, Mohammad Khalooei, Mahmood Fathy, and Ehsan Adeli. “Adversarially Learned One- Class Classifier for Novelty Detection.” ArXiv:1802.09088 [Cs], February 25, 2018. http://arxiv.org/abs/1802.09088.
 Goldstein, Markus, and Seiichi Uchida. “A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data.” PLOS ONE 11, no. 4 (April 19, 2016): e0152173. https://doi.org/10.1371/journal.pone.0152173.
 Škvára, Vít, Tomáš Pevný, and Václav Šmídl. “Are Generative Deep Models for Novelty Detection Truly Better?” ArXiv:1807.05027 [Cs, Stat], July 13, 2018. http://arxiv.org/abs/1807.05027.
 Hendrycks, Dan, and Kevin Gimpel. “A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks.” ArXiv:1610.02136 [Cs], October 7, 2016. http://arxiv.org/abs/1610.02136.
 Mehrtash, Alireza, William M. Wells III, Clare M. Tempany, Purang Abolmaesumi, and Tina Kapur. “Confidence Calibration and Predictive Uncertainty Estimation for Deep Medical Image Segmentation.” ArXiv:1911.13273 [Cs, Eess], November 29, 2019. http://arxiv.org/abs/1911.13273.
 Roady, Ryne, Tyler L. Hayes, Ronald Kemker, Ayesha Gonzales, and Christopher Kanan. “Are Out-of- Distribution Detection Methods Effective on Large-Scale Datasets?” ArXiv:1910.14034 [Cs], October 30, 2019. http://arxiv.org/abs/1910.14034.
 Shafaei, Alireza, Mark Schmidt, and James J. Little. “A Less Biased Evaluation of Out-of-Distribution Sample Detectors.” ArXiv:1809.04729 [Cs, Stat], August 20, 2019. http://arxiv.org/abs/1809.04729.