MultiCaRe: An open-source clinical case dataset for medical image classification and multimodal AI applications
Description
The dataset contains multi-modal data from over 70,000 open access and de-identified case reports, including metadata, clinical cases, image captions and more than 130,000 images. Images and clinical cases belong to different medical specialties, such as oncology, cardiology, surgery and pathology. The structure of the dataset allows to easily map images with their corresponding article metadata, clinical case, captions and image labels. Details of the data structure can be found in the file data_dictionary.csv.
More than 90,000 patients and 280,000 medical doctors and researchers were involved in the creation of the articles included in this dataset. The citation data of each article can be found in the metadata.parquet file.
Refer to the examples showcased in this GitHub repository to understand how to optimize the use of this dataset.
The license of the dataset as a whole is CC BY-NC-SA. However, its individual contents may have less restrictive license types (CC BY, CC BY-NC, CC0). For instance, regarding image filess, 66K of them are CC BY, 32K are CC BY-NC-SA, 32K are CC BY-NC, and 20 of them are CC0.
Files
data_dictionary.csv
Files
(2.9 GB)
Name | Size | Download all |
---|---|---|
md5:6147674303929e5acc9e8986e747ea34
|
45.0 MB | Download |
md5:a5f8921be1eadc0072795385e3b6180e
|
49.2 MB | Preview Download |
md5:e9800f71512a2cfc6cabc659ab5ba725
|
52.0 MB | Download |
md5:4574baaecf1ab2e5441353bdc8d09f51
|
158.7 MB | Download |
md5:2de670f0f631189192835ee17830c4e3
|
6.4 kB | Preview Download |
md5:59f484571767752f3546c1b575bfffda
|
19.3 MB | Download |
md5:39bdafd6e078d48231d72418530ee5a6
|
779.5 MB | Preview Download |
md5:a4fb23b0677937cbbe21fc88d1cc66f3
|
57.1 MB | Preview Download |
md5:1987c585d627330a22209ef86bf51c5a
|
310.0 MB | Preview Download |
md5:a5e9363ecd5e74eae0ff515cfa9bcc71
|
327.0 MB | Preview Download |
md5:0885722cb61926835d43a6927bdc273f
|
266.9 MB | Preview Download |
md5:161233a58744c46c328702374ae8827d
|
278.1 MB | Preview Download |
md5:55fd574b7324c77c14d5a6e80df0ecfa
|
226.9 MB | Preview Download |
md5:6290db3092376daf83e1ce9b6e5493a8
|
241.1 MB | Preview Download |
md5:42683f0550d8ae98e1a901b4bdc336cc
|
57.0 MB | Preview Download |
Additional details
Related works
- Is published in
- Data paper: 10.1016/j.dib.2023.110008 (DOI)
Software
- Repository URL
- https://github.com/mauro-nievoff/MultiCaRe_Dataset
- Programming language
- Python
- Development Status
- Active