CIRDataset: A large-scale Dataset for Clinically-Interpretable lung nodule Radiomics and malignancy prediction

Choi, Wookjin; Dahiya, Navdeep; Nadeem, Saad

doi:10.5281/zenodo.6762573

Published June 27, 2022 | Version v2

Conference paper Open

CIRDataset: A large-scale Dataset for Clinically-Interpretable lung nodule Radiomics and malignancy prediction

1. Thomas Jefferson University Hospital
2. Georgia Institute of Technology
3. Memorial Sloan Kettering Cancer Center

We release 956 radiologist QA/QC’ed spiculation/lobulation annotations on segmented lung nodules for two public datasets, LIDC (with visual radiologist malignancy RM scores for the entire cohort and pathology-proven malignancy PM labels for a subset) and LUNGx (with pathology-proven size-matched benign/malignant nodules to remove the effect of size on malignancy prediction). We also release our multi-class Voxel2Mesh extension (available on our Clinically-Intrepretable Radiomics GitHub) to provide a good baseline for end-to-end deep learning lung nodule segmentation, peaks’ classification (lobulation/spiculation), and malignancy prediction; Voxel2Mesh is the only published method to our knowledge that preserves sharp peaks during segmentation and hence its use as our base model.

The primary motivation of this work comes from our collaborators in radiology inquiring about the importance of clinically-reported LUNG-RADS features such as spiculation/lobulation in state-of-the-art deep learning malignancy prediction methods. Previous methods have performed malignancy prediction for LIDC and LUNGx datasets but without robust attribution to any clinically reported/actionable features (see extensive literature on sensitivity of attribution methods to hyperparameters). This motivated us to annotate clinically-reported features at voxel/vertex-level on public lung nodule datasets (using our negative area distortion metric computed via spherical parameterization to annotate spiculations/lobulations on meshes followed by radiologist QA/QC) and relating these to malignancy prediction (bypassing the “flaky” attribution schemes). With the release of this comprehensively-annotated dataset, we hope that previous malignancy prediction methods can also validate their explanations and provide clinically-actionable insights. We also release our entire pipeline to generate the spiculation/lobulation annotations from scratch for LIDC/LUNGx as well as new datasets.

Notes

Accompanying GitHub repository is available here: https://github.com/nadeemlab/CIR.

Files

Files (2.1 GB)

Name	Size	Download all
CIRDataset_LCSR.tar.bz2 md5:87f64bf8727343abdbb7ffbb836e4768	502.7 MB	Download
CIRDataset_npy_for_cnn.tar.bz2 md5:9444fdeb008d9ef18fbcf99b28798401	230.2 MB	Download
CIRDataset_pickle_for_voxel2mesh.tar.bz2 md5:6533ff009ec613fc5f55731758ddcdba	412.3 MB	Download
pretrained_model-mesh+encoder.tar.bz2 md5:e3befd7103174d8a25c69ce3f710ee42	472.8 MB	Download
pretrained_model-meshonly.tar.bz2 md5:3cb42b1e8cf2d08dd8746bef2777578a	525.6 MB	Download

	All versions	This version
Views	1,650	1,361
Downloads	1,027	939
Data volume	741.6 GB	708.6 GB

CIRDataset: A large-scale Dataset for Clinically-Interpretable lung nodule Radiomics and malignancy prediction

Creators

Description

Notes

Files

Files (2.1 GB)