Published October 9, 2020 | Version 1.0
Dataset Restricted

RAD-ChestCT Dataset

  • 1. Duke Computer Science, Duke Univ School of Medicine
  • 2. Duke ECE
  • 3. Duke Radiology, ECE, B&B
  • 4. Duke Radiology, ECE, BME
  • 5. Duke ECE, B&B
  • 6. Univ of Arizona, Dept of Medical Imaging
  • 7. King Abdullah University of Science & Technology



The RAD-ChestCT dataset is a large medical imaging dataset developed by Duke MD/PhD student Rachel Draelos during her Computer Science PhD supervised by Lawrence Carin. The full dataset includes 35,747 chest CT scans from 19,661 adult patients. This Zenodo repository contains an initial release of 3,630 chest CT scans, approximately 10% of the dataset. This dataset is of significant interest to the machine learning and medical imaging research communities.


The following published paper includes a description of how the RAD-ChestCT dataset was created: Draelos et al., "Machine-Learning-Based Multiple Abnormality Prediction with Large-Scale Chest Computed Tomography Volumes," Medical Image Analysis 2021. DOI: 10.1016/

Two additional papers leveraging the RAD-ChestCT dataset are available as preprints:

"Use HiResCAM instead of Grad-CAM for faithful explanations of convolutional neural networks" (

"Explainable multiple abnormality classification of chest CT volumes with deep learning" (

Details about the files included in this data release

Metadata Files (4)

CT_Scan_Metadata_Complete_35747.csv: includes metadata about the whole dataset, with information extracted from DICOM headers.

Extrema_5747.csv: includes coordinates for lung bounding boxes for the whole dataset. Coordinates were derived computationally using a morphological image processing lung segmentation pipeline.

Indications_35747.csv: includes scan indications for the whole dataset. Indications were extracted from the free-text reports.

Summary_3630.csv: includes a listing of the 3,630 scans that are part of this repository.

Label Files (3)

The label files contain abnormality x location labels for the 3,630 shared CT volumes. Each CT volume is annotated with a matrix of 84 abnormality labels x 52 location labels. Labels were extracted from the free text reports using the Sentence Analysis for Radiology Label Extraction (SARLE) framework. For each CT scan, the label matrix has been flattened and the abnormalities and locations are separated by an asterisk in the CSV column headers (e.g. "mass*liver"). The labels can be used as the ground truth when training computer vision classifiers on the CT volumes. Label files include: imgtrain_Abnormality_and_Location_Labels.csv (for the training set)

imgvalid_Abnormality_and_Location_Labels.csv (for the validation set)

imgtest_Abnormality_and_Location_Labels.csv (for the test set)

CT Volume Files (3,630)

Each CT scan is provided as a compressed 3D numpy array (npz format). The CT scans can be read using the Python package numpy, version 1.14.5 and above.

Related Code

Code related to RAD-ChestCT is publicly available on GitHub at

Repositories of interest include: contains PyTorch code to load the RAD-ChestCT dataset and train convolutional neural network models for multiple abnormality prediction from whole CT volumes. contains an end-to-end Python framework to convert CT scans from DICOM to numpy format. This code was used to prepare the RAD-ChestCT volumes. contains the Python implementation of the SARLE label extraction framework used to generate the abnormality and location label matrix from the free text reports. SARLE has minimal dependencies and the abnormality and location vocabulary terms can be easily modified to adapt SARLE to different radiologic modalities, abnormalities, and anatomical locations.



The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

This dataset, and any copyrights therein, are owned by Duke University. In order to receive the dataset, you must choose one of the following two licenses:

1. An open license under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license:

2. A custom license with Duke University, for use without the CC-BY-NC-ND-4.0 restrictions, which can include commercial uses.


Outside contributions to the Duke-owned dataset cannot be accepted unless the contributor assigns copyrights to any modifications, changes, and/or derivatives over to Duke University.


To enter a license agreement with the CC BY-NC-ND 4.0 restrictions, please email Mention that you are looking to access this dataset on Zenodo and  provide your academic affiliation and a brief description of why you would like to use this data.

To enter a license agreement without the CC-BY-NC-ND-4.0 restrictions, please contact the Digital Innovations department at Duke Office for Translation & Commercialization (OTC) ( at with reference to “Zenodo DOI 10.5281/zenodo.6406114” in your email. 


Please include in your email your affiliation (if applicable) and a brief description of your research topics and why you would like to use this dataset. Duke University will make use of this information to evaluate approval of your request.

Please note that this dataset is distributed AS IS, WITHOUT ANY WARRANTY; and without the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


You are currently not logged in. Do you have an account? Log in here