Duke Lung Cancer Screening Dataset 2024

1. Duke University School of Medicine
2. Duke University
3. Duke University Health System

Note - This is part 1 of the dataset.

Part 1 can be found at : https://zenodo.org/records/13799069
Part 2 can be found at : https://zenodo.org/records/12784601
Part 3 can be found at : https://zenodo.org/records/14659131

Background: Lung cancer risk classification is an increasingly important area of research as low-dose thoracic CT screening programs have become standard of care for patients at high risk for lung cancer. There is limited availability of large, annotated public databases for the training and testing of algorithms for lung nodule classification.

Methods: Screening chest CT scans done between January 1, 2015 and June 30, 2021 at Duke University Health System were considered for this study. Efficient nodule annotation was performed semi-automatically by using a publicly available deep learning nodule detection algorithm trained on the LUNA16 dataset to identify initial candidates, which were then accepted based on nodule location in the radiology text report or manually annotated by a medical student and a fellowship-trained cardiothoracic radiologist.

Results: The dataset contains 1613 CT volumes with 2487 annotated nodules, selected from a total dataset of 2061 patients, with the remaining data reserved for future testing. Radiologist spot-checking confirmed the semi-automated annotation had an accuracy rate of >90%.

Conclusions: The Duke Lung Cancer Screening Dataset 2024 is the first large dataset for CT screening for lung cancer reflecting the use of current CT technology. This represents a useful resource of lung cancer risk classification research, and the efficient annotation methods described for its creation may be used to generate similar databases for research in the future.

Dataset part Details:
Part 1: DLCS subset 1 to 7 and, metadata and Annotations.
Part 2: DLCS subset 8,9 and CT image info metadata.
Part 3: DLCS subset 10.

Updates and Versions:

Part 1, Version 1.0 (Published on [03/05/2024]): Released initial dataset, including partial data subsets 1 to 7 and 3D bounding box annotations of the lung nodules.
Part 1, Version 1.1 (Published on [09/19/2024]): Added metadata file (DLCSD24_metadata_v1.1.xlsx) and updated the dataset description and title. 10.5281/zenodo.13799069
Part 2, Version 1.0 (Published on [02/04/2025]): Released DLCS subset 8,9, CT image info metadata (DLCSD24_CT_ImageInfo_v1.csv and metadata documentation).
Part 3, Version 1.0 (Published on [02/04/2025]): Released DLCS subset 10.

Code Repository:
To support reproducible open-access research and benchmarking, we have shared several pre-trained models and baseline results in a GitHub and GitLab repository.

GitLab: https://gitlab.oit.duke.edu/cvit-public/ai_lung_health_benchmarking
GitHub: https://github.com/fitushar/AI-in-Lung-Health-Benchmarking-Detection-and-Diagnostic-Models-Across-Multiple-CT-Scan-Datasets

Funding:
This work was supported by the Duke Department of Radiology Charles E. Putman Vision Award, NIH/NIBIB P41-EB028744, and NIH/NCI R01-CA261457.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You are currently not logged in. Do you have an account? Log in here

Additional details

References: Preprint: arXiv:2405.04605 (arXiv)

Repository URL: https://github.com/fitushar/AI-in-Lung-Health-Benchmarking-Detection-and-Diagnostic-Models-Across-Multiple-CT-Scan-Datasets

Tushar, Fakrul Islam, et al. "AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan Datasets." arXiv preprint arXiv:2405.04605 (2024).
Lafata, Kyle J., et al. "Lung Cancer Screening in Clinical Practice: A Five-year Review of Frequency and Predictors of Lung Cancer in the Screened Population." Journal of the American College of Radiology (2023).

Views

Downloads

Show more details

	All versions	This version
Views	8,487	7,117
Downloads	3,007	2,542
Data volume	96.0 TB	81.3 TB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

License: Creative Commons Attribution Non Commercial No Derivatives 4.0 International

No further description. Read more