Dataset Restricted Access
Draelos, Rachel Lea;
Dov, David;
Mazurowski, Maciej A;
Lo, Joseph Y.;
Henao, Ricardo;
Rubin, Geoffrey D.;
Carin, Lawrence
Overview
The RAD-ChestCT dataset is a large medical imaging dataset developed by Duke MD/PhD student Rachel Draelos during her Computer Science PhD supervised by Lawrence Carin. The full dataset includes 35,747 chest CT scans from 19,661 adult patients. This Zenodo repository contains an initial release of 3,630 chest CT scans, approximately 10% of the dataset. This dataset is of significant interest to the machine learning and medical imaging research communities.
Papers
The following published paper includes a description of how the RAD-ChestCT dataset was created: Draelos et al., "Machine-Learning-Based Multiple Abnormality Prediction with Large-Scale Chest Computed Tomography Volumes," Medical Image Analysis 2021. DOI: 10.1016/j.media.2020.101857 https://pubmed.ncbi.nlm.nih.gov/33129142/
Two additional papers leveraging the RAD-ChestCT dataset are available as preprints:
"Use HiResCAM instead of Grad-CAM for faithful explanations of convolutional neural networks" (https://arxiv.org/abs/2011.08891)
"Explainable multiple abnormality classification of chest CT volumes with deep learning" (https://arxiv.org/abs/2111.12215)
Details about the files included in this data release
Metadata Files (4)
CT_Scan_Metadata_Complete_35747.csv: includes metadata about the whole dataset, with information extracted from DICOM headers.
Extrema_5747.csv: includes coordinates for lung bounding boxes for the whole dataset. Coordinates were derived computationally using a morphological image processing lung segmentation pipeline.
Indications_35747.csv: includes scan indications for the whole dataset. Indications were extracted from the free-text reports.
Summary_3630.csv: includes a listing of the 3,630 scans that are part of this repository.
Label Files (3)
The label files contain abnormality x location labels for the 3,630 shared CT volumes. Each CT volume is annotated with a matrix of 84 abnormality labels x 52 location labels. Labels were extracted from the free text reports using the Sentence Analysis for Radiology Label Extraction (SARLE) framework. For each CT scan, the label matrix has been flattened and the abnormalities and locations are separated by an asterisk in the CSV column headers (e.g. "mass*liver"). The labels can be used as the ground truth when training computer vision classifiers on the CT volumes. Label files include: imgtrain_Abnormality_and_Location_Labels.csv (for the training set)
imgvalid_Abnormality_and_Location_Labels.csv (for the validation set)
imgtest_Abnormality_and_Location_Labels.csv (for the test set)
CT Volume Files (3,630)
Each CT scan is provided as a compressed 3D numpy array (npz format). The CT scans can be read using the Python package numpy, version 1.14.5 and above.
Related Code
Code related to RAD-ChestCT is publicly available on GitHub at https://github.com/rachellea.
Repositories of interest include:
https://github.com/rachellea/ct-net-models contains PyTorch code to load the RAD-ChestCT dataset and train convolutional neural network models for multiple abnormality prediction from whole CT volumes.
https://github.com/rachellea/ct-volume-preprocessing contains an end-to-end Python framework to convert CT scans from DICOM to numpy format. This code was used to prepare the RAD-ChestCT volumes.
https://github.com/rachellea/sarle-labeler contains the Python implementation of the SARLE label extraction framework used to generate the abnormality and location label matrix from the free text reports. SARLE has minimal dependencies and the abnormality and location vocabulary terms can be easily modified to adapt SARLE to different radiologic modalities, abnormalities, and anatomical locations.
You may request access to the files in this upload, provided that you fulfil the conditions below. The decision whether to grant/deny access is solely under the responsibility of the record owner.
This dataset, and any copyrights therein, are owned by Duke University. In order to receive the dataset, you must choose one of the following two licenses:
1. An open license under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license: https://creativecommons.org/licenses/by-nc-nd/4.0/
2. A custom license with Duke University, for use without the CC-BY-NC-ND-4.0 restrictions, which can include commercial uses.
Outside contributions to the Duke-owned dataset cannot be accepted unless the contributor assigns copyrights to any modifications, changes, and/or derivatives over to Duke University.
To enter a license agreement with the CC BY-NC-ND 4.0 restrictions, please email DOCR.Help@dm.Duke.edu. Mention that you are looking to access this dataset on Zenodo and provide your academic affiliation and a brief description of why you would like to use this data.
To enter a license agreement without the CC-BY-NC-ND-4.0 restrictions, please contact the Digital Innovations department at Duke Office for Translation & Commercialization (OTC) (https://otc.duke.edu/software/) at otcquestions@duke.edu with reference to “Zenodo DOI 10.5281/zenodo.6406114” in your email.
Please include in your email your affiliation (if applicable) and a brief description of your research topics and why you would like to use this dataset. Duke University will make use of this information to evaluate approval of your request.
Please note that this dataset is distributed AS IS, WITHOUT ANY WARRANTY; and without the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
All versions | This version | |
---|---|---|
Views | 6,572 | 6,572 |
Downloads | 134,070 | 134,070 |
Data volume | 10.9 TB | 10.9 TB |
Unique views | 2,701 | 2,701 |
Unique downloads | 2,777 | 2,777 |