Chest Radiograph at Diverse Institutes (CRADI) dataset
Authors/Creators
- 1. Winning Health Technology Group Ltd
- 2. Shanghai General Hospital
Description
Introduction to the Chest Radiograph at Diverse Institutes (CRADI) dataset
Background
Chest radiography is extensively used to screen and diagnose pulmonary and cardiac diseases. The advantages of its clinical practicality, efficiency, and cost-effectiveness make chest radiography the most accessible imaging test for pulmonary disorders, especially in primary hospitals. Currently, the interpretation of a chest radiograph mainly relies on radiologists.
With the development of algorithms, convolutional neural networks (CNNs) have shown the ability to detect a single disorder in chest radiography, e.g., pneumothorax, lung cancer, pneumonia, and tuberculosis. Traditionally, expert annotation is applied to establish a CNN model for classifying medical images. This manual labeling procedure is time-consuming and highly demanding. Importantly, beyond the detection of a single disease, multi-label classification is necessary to interpret a chest radiograph in clinical practice.
In order to promote the development of the artificial intelligence-assisted diagnosis of chest radiography, we launched the Chest Radiograph at Diverse Institutes (CRADI) dataset. This dataset is comprised of a large number of chest radiographs. Each radiograph has a 25-label disorder annotation, that was established by the terms adopted from the Fleischner’s glossary, and extracted from the original diagnostic report by natural language processing (NLP) and radiologist expertise.
At present, the data of the CRADI dataset comes from two academic hospitals and multiple community clinics in Shanghai. The cases in the CRADI dataset are comprised of in-patients, out-patients, and screening participants.
The CRADI dataset provides a better understanding of the multiple and different clinical data sources for chest radiography, which is potentially helpful for the training and test of CNN models.
We welcome more data into the CRADI dataset. If you find it helpful to your research work or you want to contribute to this dataset, please feel free to contact us.
Data source
Number of images
Data_source
Academic hospital 1
74,082
0
In- and out-patient from Academic hospital 2
5,996
1
Screening participants from Academic hospital 2
2,130
2
Community clinics
1,804
3
Note. Each case includes one posterior-anterior (PA) view chest radiograph and the corresponding text label of disorder findings.
Preprocessing methods
Images: transformed and resized from DCM format to PNG, changed from 12-bit grayscale to 8-bit. All patient- or institute-related information is de-identified.
Label: labels are extracted from the original diagnostic reports. The regular expression is applied by NLP and rules-based extraction methods. In total, 25 labels are extracted for each image.
Data
Overview: All images are compressed into one file. All classification labels are listed in one CSV file. Data order is as following way:
Data organization
Classification result
Each image links to the label by an item of ‘patientID’.
data_resource[1] stands for the resource of data. Data sources are listed in the previous table.
Result table format: |pateintID|data_resource[2] |label1|label2|.......|label25|
The order of the 25 labels is as the following:
1) pneumothorax, 2) emphysema, 3) pulmonary parenchymal calcification, 4) PICC implant, 5) aortic unfolding, 6) aortic arteriosclerosis, 7) aortic abnormalities, 8) small consolidation, 9) cardiomegaly, 10) patchy consolidation, 11) consolidation, 12) cavity, 13) mass, 14) prominent bronchovascular marking, 15) pulmonary edema, 16) pulmonary nodule, 17) hilar adenopathy, 18) pleural effusion, 19) pleural thickening, 20) pleural adhesion, 21) pleural calcification, 22) pleural abnormalities, 23) scoliosis, 24) pacemaker implant, 25) interstitial involvement.
Data_resource or data_source?
同上
Introduction to the Chest Radiograph at Diverse Institutes (CRADI) dataset
Background
Chest radiography is extensively used to screen and diagnose pulmonary and cardiac diseases. The advantages of its clinical practicality, efficiency, and cost-effectiveness make chest radiography the most accessible imaging test for pulmonary disorders, especially in primary hospitals. Currently, the interpretation of a chest radiograph mainly relies on radiologists.
With the development of algorithms, convolutional neural networks (CNNs) have shown the ability to detect a single disorder in chest radiography, e.g., pneumothorax, lung cancer, pneumonia, and tuberculosis. Traditionally, expert annotation is applied to establish a CNN model for classifying medical images. This manual labeling procedure is time-consuming and highly demanding. Importantly, beyond the detection of a single disease, multi-label classification is necessary to interpret a chest radiograph in clinical practice.
In order to promote the development of the artificial intelligence-assisted diagnosis of chest radiography, we launched the Chest Radiograph at Diverse Institutes (CRADI) dataset. This dataset is comprised of a large number of chest radiographs. Each radiograph has a 25-label disorder annotation, that was established by the terms adopted from the Fleischner’s glossary, and extracted from the original diagnostic report by natural language processing (NLP) and radiologist expertise.
At present, the data of the CRADI dataset comes from two academic hospitals and multiple community clinics in Shanghai. The cases in the CRADI dataset are comprised of in-patients, out-patients, and screening participants.
The CRADI dataset provides a better understanding of the multiple and different clinical data sources for chest radiography, which is potentially helpful for the training and test of CNN models.
We welcome more data into the CRADI dataset. If you find it helpful to your research work or you want to contribute to this dataset, please feel free to contact us.
Data source
training data data source: 0
In- and out-patient from external hospital data source: 1
Screening participants from external hospital data source : 2
Community clinics datasource: 3
Note. Each case includes one posterior-anterior (PA) view chest radiograph and the corresponding text label of disorder findings.
Preprocessing methods
Images: transformed and resized from DCM format to PNG, changed from 12-bit grayscale to 8-bit. All patient- or institute-related information is de-identified.
Label: labels are extracted from the original diagnostic reports. The regular expression is applied by NLP and rules-based extraction methods. In total, 25 labels are extracted for each image.
Data
Overview: All images are compressed into one file. All classification labels are listed in one CSV file. Data order is as following way:
Classification result
Each image links to the label by an item of ‘patientID’.
data_resource[1] stands for the resource of data. Data sources are listed in the previous table.
Result table format: |pateintID|data_resource[2] |label1|label2|.......|label25|
The order of the 25 labels is as the following:
1) pneumothorax, 2) emphysema, 3) pulmonary parenchymal calcification, 4) PICC implant, 5) aortic unfolding, 6) aortic arteriosclerosis, 7) aortic abnormalities, 8) small consolidation, 9) cardiomegaly, 10) patchy consolidation, 11) consolidation, 12) cavity, 13) mass, 14) prominent bronchovascular marking, 15) pulmonary edema, 16) pulmonary nodule, 17) hilar adenopathy, 18) pleural effusion, 19) pleural thickening, 20) pleural adhesion, 21) pleural calcification, 22) pleural abnormalities, 23) scoliosis, 24) pacemaker implant, 25) interstitial involvement.
Data will be released after anonymilization process.
Files
CHARDI_test.zip
Files
(14.6 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:b9b3965faf0ce47101f09fa4512af501
|
7.2 MB | Download |
|
md5:d051b30f7f5e1ec3991f92d0f67011b9
|
14.6 GB | Preview Download |