Duke Lung Cancer Screening Dataset 2024
Authors/Creators
Description
Note - This is part 1 of the dataset.
Part 1 can be found at : https://zenodo.org/records/13799069
Part 2 can be found at : https://zenodo.org/records/12784601
Part 3 can be found at : https://zenodo.org/records/14659131
Background: Lung cancer risk classification is an increasingly important area of research as low-dose thoracic CT screening programs have become standard of care for patients at high risk for lung cancer. There is limited availability of large, annotated public databases for the training and testing of algorithms for lung nodule classification.
Methods: Screening chest CT scans done between January 1, 2015 and June 30, 2021 at Duke University Health System were considered for this study. Efficient nodule annotation was performed semi-automatically by using a publicly available deep learning nodule detection algorithm trained on the LUNA16 dataset to identify initial candidates, which were then accepted based on nodule location in the radiology text report or manually annotated by a medical student and a fellowship-trained cardiothoracic radiologist.
Results: The dataset contains 1613 CT volumes with 2487 annotated nodules, selected from a total dataset of 2061 patients, with the remaining data reserved for future testing. Radiologist spot-checking confirmed the semi-automated annotation had an accuracy rate of >90%.
Conclusions: The Duke Lung Cancer Screening Dataset 2024 is the first large dataset for CT screening for lung cancer reflecting the use of current CT technology. This represents a useful resource of lung cancer risk classification research, and the efficient annotation methods described for its creation may be used to generate similar databases for research in the future.
Dataset part Details:
Part 1: DLCS subset 1 to 7 and, metadata and Annotations.
Part 2: DLCS subset 8,9 and CT image info metadata.
Part 3: DLCS subset 10.
Updates and Versions:
- Part 1, Version 1.0 (Published on [03/05/2024]): Released initial dataset, including partial data subsets 1 to 7 and 3D bounding box annotations of the lung nodules.
- Part 1, Version 1.1 (Published on [09/19/2024]): Added metadata file (DLCSD24_metadata_v1.1.xlsx) and updated the dataset description and title. 10.5281/zenodo.13799069
- Part 2, Version 1.0 (Published on [02/04/2025]): Released DLCS subset 8,9, CT image info metadata (DLCSD24_CT_ImageInfo_v1.csv and metadata documentation).
- Part 3, Version 1.0 (Published on [02/04/2025]): Released DLCS subset 10.
Code Repository:
To support reproducible open-access research and benchmarking, we have shared several pre-trained models and baseline results in a GitHub and GitLab repository.
GitLab: https://gitlab.oit.duke.edu/cvit-public/ai_lung_health_benchmarking
GitHub: https://github.com/fitushar/AI-in-Lung-Health-Benchmarking-Detection-and-Diagnostic-Models-Across-Multiple-CT-Scan-Datasets
Funding:
This work was supported by the Duke Department of Radiology Charles E. Putman Vision Award, NIH/NIBIB P41-EB028744, and NIH/NCI R01-CA261457.
Files
Additional details
Related works
- References
- Preprint: arXiv:2405.04605 (arXiv)
Software
References
- Tushar, Fakrul Islam, et al. "AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan Datasets." arXiv preprint arXiv:2405.04605 (2024).
- Lafata, Kyle J., et al. "Lung Cancer Screening in Clinical Practice: A Five-year Review of Frequency and Predictors of Lung Cancer in the Screened Population." Journal of the American College of Radiology (2023).