Published May 26, 2023 | Version v4
Dataset Open

AI-derived annotations for the NLST and NSCLC-Radiomics computed tomography imaging collections

  • 1. Brigham and Women's Hospital
  • 2. Maastricht University
  • 3. PixelMed Publishing

Description

Public imaging datasets are critical for the development and evaluation of automated tools in cancer imaging. Unfortunately, many of the available datasets do not provide annotations of tumors or  organs-at-risk, crucial for the assessment of these tools. This is due to the fact that annotation of medical images is time consuming and requires domain expertise. It has been demonstrated that artificial intelligence (AI) based annotation tools can achieve acceptable performance and thus can be used to automate the annotation of large datasets. As part of the effort to enrich the public data available within NCI Imaging Data Commons (IDC) (https://imaging.datacommons.cancer.gov/) [1], we introduce this dataset that consists of such AI-generated annotations for two publicly available medical imaging collections of Computed Tomography (CT) images of the chest. For detailed information concerning this dataset, please refer to our publication here [2]. 

We use publicly available pre-trained AI tools to enhance CT lung cancer collections that are unlabeled or partially labeled. The first tool is the nnU-Net deep learning framework [3] for volumetric segmentation of organs, where we use a pretrained model (Task D18 using the SegTHOR dataset) for labeling volumetric regions in the image corresponding to the heart, trachea, aorta and esophagus. These are the major organs-at-risk for radiation therapy for lung cancer. We further enhance these annotations by computing 3D shape radiomics features using the pyradiomics package [4]. The second tool is a pretrained model for per-slice automatic labeling of anatomic landmarks and imaged body part regions in axial CT volumes [5].

We focus on enhancing two publicly available collections, the Non-small Cell Lung Cancer Radiomics (NSCLC-Radiomics collection) [6,7], and the National Lung Screening Trial (NLST collection) [8,9]. The CT data for these collections are available both in The Cancer Imaging Archive (TCIA) [10] and in NCI Imaging Data Commons (IDC). Further, the NSLSC-Radiomics collection includes expert-generated manual annotations of several chest organs, allowing us to quantify performance of the AI tools in that subset of data.

IDC is relying on the DICOM standard to achieve FAIR [10] sharing of data and interoperability. Generated annotations are saved as DICOM Segmentation objects (volumetric segmentations of regions of interest) created using the dcmqi [12], and DICOM Structured Report (SR) objects (per-slice annotations of the body part imaged, anatomical landmarks and radiomics features) created using dcmqi and highdicom [13]. 3D shape radiomics features and corresponding DICOM SR objects are also provided for the manual segmentations available in the NSCLC-Radiomics collection. 

The dataset is available in IDC, and is accompanied by our publication here [2]. This pre-print details how the data were generated, and how the resulting DICOM objects can be interpreted and used in tools. Additionally, for further information about how to interact with and explore the dataset, please refer to our repository and accompanying Google Colaboratory notebook

The annotations are organized as follows. For NSCLC-Radiomics, three nnU-Net models were evaluated ('2d-tta', '3d_lowres-tta' and '3d_fullres-tta'). Within each folder, the PatientID and the StudyInstanceUID are subdirectories, and within this the DICOM Segmentation object and the DICOM SR for the 3D shape features are stored. A separate directory for the DICOM SR body part regression regions ('sr_regions') and landmarks ('sr_landmarks') are also provided with the same folder structure as above. Lastly, the DICOM SR for the existing manual annotations are provided in the 'sr_gt' directory. For NSCLC-Radiomics, each patient has a single StudyInstanceUID. The DICOM Segmentation and SR objects are named according to the SeriesInstanceUID of the original CT files. 

  • nsclc
    • 2d-tta
      • PatientID
        • StudyInstanceUID
          • ReferencedSeriesInstanceUID_SEG.dcm
          • ReferencedSeriesInstanceUID_features_SR.dcm
    • 3d_lowres-tta
      • PatientID
        • StudyInstanceUID
          • ReferencedSeriesInstanceUID_SEG.dcm
          • ReferencedSeriesInstanceUID_features_SR.dcm
    • 3d_fullres-tta 
      • PatientID
        • StudyInstanceUID
          • ReferencedSeriesInstanceUID_SEG.dcm
          • ReferencedSeriesInstanceUID_features_SR.dcm
    • sr_regions
      • PatientID
        • StudyInstanceUID
          • ReferencedSeriesInstanceUID_regions_SR.dcm
    • sr_landmarks
      • PatientID
        • StudyInstanceUID
          • ReferencedSeriesInstanceUID_landmarks_SR.dcm
    • sr_gt
      • PatientID
        • StudyInstanceUID
          • ReferencedSeriesInstanceUID_features_SR.dcm

For NLST, the '3d_fullres-tta' model was evaluated. The data is organized the same as above, where within each folder the PatientID and the StudyInstanceUID are subdirectories. For the NLST collection, it is possible that some patients have more than one StudyInstanceUID subdirectory. A separate directory for the DICOM SR body par regions ('sr_regions') and landmarks ('sr_landmarks') are also provided. The DICOM Segmentation and SR objects are named according to the SeriesInstanceUID of the original CT files. 

  • nlst
    • 3d_fullres-tta 
      • PatientID
        • StudyInstanceUID
          • ReferencedSeriesInstanceUID_SEG.dcm
          • ReferencedSeriesInstanceUID_features_SR.dcm
    • sr_regions
      • PatientID
        • StudyInstanceUID
          • ReferencedSeriesInstanceUID_regions_SR.dcm
    • sr_landmarks
      • PatientID
        • StudyInstanceUID
          • ReferencedSeriesInstanceUID_landmarks_SR.dcm 

The query used for NSCLC-Radiomics is here, and a list of corresponding SeriesInstanceUIDs (along with PatientIDs and StudyInstanceUIDs) is here. The query used for NLST is here, and a list of corresponding SeriesInstanceUIDs (along with PatientIDs and StudyInstanceUIDs) is here. The two csv files that describe the series analyzed, nsclc_series_analyzed.csv and nlst_series_analyzed.csv, are also available as uploads to this repository. 

Version updates: 

Version 2: For the regions SR and landmarks SR, changed to use a distinct TrackingUniqueIdentifier for each MeasurementGroup. Also instead of using TargetRegion, changed to use FindingSite. Additionally for the landmarks SR, the TopographicalModifier was made a child of FindingSite instead of a sibling.

Version 3: Added the two csv files that describe which series were analyzed 

Version 4: Modified the landmarks SR as the TopographicalModifier for the Kidney landmark (bottom) does not describe the landmark correctly. The Kidney landmark is the "first slice where both kidneys can be seen well." Instead, removed the use of the TopographicalModifier for that landmark. For the features SR, modified the units code for the Flatness and Elongation, as we incorrectly used mm units instead of no units. 

Notes

This project has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l. This project has also been funded in whole or in part with Federal funds from the National Institute of Biomedical Imaging and Bioengineering, National Institutes of Health, Grant number T32EB025823 for Training in Image-Guidance, Precision Diagnosis and Therapy. The following are GitHub links to the packages used -- nnU-Net: https://github.com/MIC-DKFZ/nnUNet, Body part regression: https://github.com/mic-dkfz/bodypartregression, Pyradiomics: https://github.com/AIM-Harvard/pyradiomics, Highdicom: https://github.com/herrmannlab/highdicom, DCMQI: https://github.com/QIICR/dcmqi, GitHub repo with our Colaboratory notebooks: https://github.com/ImagingDataCommons/ai_medima_misc/tree/main/nnunet/notebooks

Files

nlst_series_analyzed.csv

Files (170.9 MB)

Name Size Download all
md5:df02ecf748f82eaa87f0f23c000b20fb
101.4 MB Download
md5:aeef079629a9676055683532ab17b95b
135.1 kB Preview Download
md5:9175bf9160abd2722e2c84a0e78eb3d0
69.2 MB Download
md5:f4ff1486257f82e29b54fe30ab171109
56.9 kB Preview Download

Additional details

Related works

Is derived from
Dataset: 10.7937/TCIA.HMQ8-J677 (DOI)
Dataset: 10.7937/K9/TCIA.2015.PF0M9REI (DOI)
References
Journal article: 10.1158/0008-5472.CAN-21-0950 (DOI)
Journal article: 10.1007/s10278-013-9622-7 (DOI)

References

  • [1] Fedorov A, Longabaugh WJ, Pot D, Clunie DA, Pieper S, Aerts HJ, Homeyer A, Lewis R, Akbarzadeh A, Bontempi D, Clifford W. NCI imaging data commons. Cancer research. 2021 Aug 8;81(16):4188.
  • [2] Krishnaswamy D, Bontempi D, Thiriveedhi VK, Punzo D, Clunie D, Bridge CP, Aerts HJ, Kikinis R, Fedorov A. Enrichment of lung cancer computed tomography collections with AI-derived annotations. Scientific Data. 2024 Jan 4;11(1):25.
  • [3] Isensee F, Jaeger PF, Kohl SA, Petersen J, Maier-Hein KH. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods. 2021 Feb;18(2):203-11.
  • [4] Van Griethuysen JJ, Fedorov A, Parmar C, Hosny A, Aucoin N, Narayan V, Beets-Tan RG, Fillion-Robin JC, Pieper S, Aerts HJ. Computational radiomics system to decode the radiographic phenotype. Cancer research. 2017 Nov 1;77(21):e104-7.
  • [5] Schuhegger S. Body Part Regression for CT Images. arXiv preprint arXiv:2110.09148. 2021 Oct 18.
  • [6] Aerts, H. J. W. L., Wee, L., Rios Velazquez, E., Leijenaar, R. T. H., Parmar, C., Grossmann, P., Carvalho, S., Bussink, J., Monshouwer, R., Haibe-Kains, B., Rietveld, D., Hoebers, F., Rietbergen, M. M., Leemans, C. R., Dekker, A., Quackenbush, J., Gillies, R. J., Lambin, P. (2019). Data From NSCLC-Radiomics [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2015.PF0M9REI
  • [7] Aerts HJ, Velazquez ER, Leijenaar RT, Parmar C, Grossmann P, Carvalho S, Bussink J, Monshouwer R, Haibe-Kains B, Rietveld D, Hoebers F. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nature communications. 2014 Jun 3;5(1):1-9.
  • [8] National Lung Screening Trial Research Team. (2013). Data from the National Lung Screening Trial (NLST) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.HMQ8-J677
  • [9] National Lung Screening Trial Research Team. Reduced lung-cancer mortality with low-dose computed tomographic screening. New England Journal of Medicine. 2011 Aug 4;365(5):395-409.
  • [10] Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. Journal of digital imaging. 2013 Dec;26(6):1045-57.
  • [11] Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten JW, da Silva Santos LB, Bourne PE, Bouwman J. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data. 2016 Mar 15;3(1):1-9.
  • [12] Herz C, Fillion-Robin JC, Onken M, Riesmeier J, Lasso A, Pinter C, Fichtinger G, Pieper S, Clunie D, Kikinis R, Fedorov A. DCMQI: an open source library for standardized communication of quantitative image analysis results using DICOM. Cancer research. 2017 Nov 1;77(21):e87-90.
  • [13] Bridge CP, Gorman C, Pieper S, Doyle SW, Lennerz JK, Kalpathy-Cramer J, Clunie DA, Fedorov AY, Herrmann MD. Highdicom: A python library for standardized encoding of image annotations and machine learning model outputs in pathology and radiology. Journal of Digital Imaging. 2022 Aug 22:1-9.