Published March 28, 2024 | Version v1
Dataset Open

CPAISD: Core-Penumbra Acute Ischemic Stroke Dataset

  • 1. Sber AI Lab
  • 2. City hospital 40 of the Saint Petersburg Resort district

Description

The dataset contains 112 non-contrast cranial CT scans of patients with hyperacute stroke, featuring delineated zones of penumbra and core of the stroke on each slice where present. The data in the dataset are anonymized using the Kitware DicomAnonymizer, with standard anonymization settings, except for preserving the values of the following fields:

  • (0x0010, 0x0040) – Patient's Sex
  • (0x0010, 0x1010) – Patient's Age
  • (0x0008, 0x0070) – Manufacturer
  • (0x0008, 0x1090) – Manufacturer’s Model Name

The patient's sex and age are retained for demographic analysis of the samples, and the equipment manufacturer and model are kept for dataset statistics and the potential for domain shift analysis.

The dataset is split into three folds:

  • Training fold (92 studies, 8,376 slices).
  • Validation fold (10 studies, 980 slices).
  • Testing fold (10 studies, 809 slices).

The dataset has the following structure:

  • metadata.json – dataset metadata
  • summary.csv – metadata of each study in a CSV format table
  • Part of the dataset (train, val, and test)
    • Study
      • Slice
        • raw.dcm – original slice file
        • image.npz – slice in Numpy array format
        • mask.npz – segmentation mask in Numpy array format
        • metadata.json – slice metadata in JSON format
      • metadata.json – study metadata in JSON format

The metadata.json at the root of the dataset has the following format:

  • generation_params – dataset generation parameters:
    • test_size – proportion of the test part
    • val_size – proportion of the validation part
  • stats – statistical data:
    • common – general statistical data:
      • train_size_in_studies – number of studies in the training part of the dataset.
      • train_size_in_images – number of slices in the training part of the dataset.
      • val_size_in_studies – number of studies in the validation part of the dataset.
      • val_size_in_images – number of slices in the validation part of the dataset.
      • test_size_in_studies – number of studies in the test part of the dataset.
      • test_size_in_images – number of slices in the test part of the dataset.
    • train – statistical data for the training part of the dataset:
      • min – minimum pixel value.
      • max – maximum pixel value.
      • mean – average pixel value.
      • std – standard deviation for all pixel values.

The metadata.json at the root of the study has the following format, if a field value is unknown, it is given as 'unknown':

  • manufacturer – manufacturer of the tomograph.
  • model – model of the tomograph.
  • device – full name of the tomograph (manufacturer + model).
  • age – patient's age in years.
  • sex – patient's sex. M – male, F – female.
  • dsa – whether cerebral angiography was performed. true if yes, false if no.
  • nihss – NIHSS score.
  • time – time in hours from the onset of the stroke to the conduct of the study. Can be either a number or a range.
  • lethality – whether the person died as a result of this stroke. true if yes, false if no.

The summary.csv contains the same fields as the `metadata.json` from the root of the study, plus two additional fields:

  • name – name of the study.
  • part – part of the dataset in which the study is located.

Files

dataset.zip

Files (5.6 GB)

Name Size Download all
md5:a034be2cc6e93b5fb696231c59df900c
5.6 GB Preview Download

Additional details

Software

Repository URL
https://github.com/sb-ai-lab/early_hyperacute_stroke_dataset
Programming language
Python
Development Status
Active