Published February 4, 2026 | Version 1.0.0
Dataset Open

Benchmark Datasets for IPD Reconstruction: Real IPD, Published KM, and Synthetic KM

Description

Overview

This repository contains three benchmarking datasets (Real IPD, Published KM, and Synthetic KM) developed to enable standardized evaluation of Individual Patient Data (IPD) reconstruction methods.

These datasets support the manuscript titled: "A Kaplan–Meier point picker with genetic optimization enables constraint-rich individual patient data reconstruction for multicenter meta-analysis", which is currently under preparation/review.

Dataset Structure

The repository is organized into three main archives:

1. Real IPD Dataset (RealIPD.zip)

Designed for single-curve reconstruction tasks. It integrates participant-level data from TCGA (21 cancer types) and GEO (210 analyses).

  • RIPD_subGroup.csv: Metadata containing summary-statistic constraints (N, E, R) and identifiers.
  • RIPD_markerInfo.csv: Comprehensive annotation file recording raw pixel coordinates, axis calibration, and mapped survival data.
  • km_curve_images/: Directory containing the generated KM curve images (indexed by km_curve_id).
  • IPD/: Directory containing the ground-truth IPD CSV files (Columns: id, time, status).

2. Published KM Dataset (PublishedKM.zip)

Derived from 280 studies (RCTs, meta-analyses) containing paired curves (Treatment vs. Control).

  • Note: Due to copyright restrictions, original KM images are NOT included. Instead, we provide extracted numerical data and metadata.
  • RA_KM.csv: Study-level metadata (Article titles, reported HRs, 95% CIs).
  • RA_subGroup.csv: Cohort-specific statistics (N, E, R, median survival times, 1-5 year survival rates).
  • RA_markerInfo.csv: Annotation data extracted from KM images. Uses 'km_curve_id' to identify individual curves within a KM image.

3. Synthetic KM Dataset (SyntheticKM.zip)

A fully controlled benchmarking environment generated via Monte Carlo simulations.

  • SA_KM.csv & SA_subGroup.csv: Follows the same schema as the Published KM dataset.
  • SA_markerInfo.csv: Ground-truth coordinates for synthetic curves.
  • km_paired_images/: Composite KM images (paired arms).
  • km_curve_images/: Separated single-arm images.

Data Dictionary and Variable Specifications

Variable definitions for RIPD_subGroup.csv

Variable Label Description
km_curve_id KM curve ID Unique identifier for the KM curve image, where each image contains a single KM curve. This ID corresponds to the source image file in the km_curve_images/ directory.
group Group label Identifier for the specific study cohort or dataset series (e.g., TCGA-LUAD or GSE10300_RFS) from which the KM curve and its corresponding IPD were obtained.
sample_size Sample size ($N$) Total number of patients in the KM curve.
number_of_events Number of events ($E$) Total number of patients with observed events.
at_risk_table At-risk table ($R$) JSON-formatted string of time points and corresponding number at risk (format: [[time, number], ...]).

Variable definitions for RIPD_markerInfo.csv

Variable Label Description
annotation_id Annotation ID Unique identifier for the annotation record of a KM curve.
key_point_coordinates Marked curve coordinates Array of $(x, y)$ pixel coordinates representing the marked breakpoints and the end point of the KM curve, mapped onto a standard $1600 \times 800$ canvas.
censoring_time_coordinates Censoring time coordinates Array of $(x, y)$ pixel coordinates representing the ticks of marked censoring times, normalized to a $1600 \times 800$ canvas.
survival_probabilities Survival probability Array of survival probabilities ($\boldsymbol{p}$) corresponding to the starting point and breakpoints, mapped to the data coordinate system (range $[0, 1]$).
time_values Time array Array of time values corresponding to the starting point and breakpoints, mapped to the data coordinate system (actual time units).
end_time End time The time value corresponding to the end point ($t_M$) of the KM curve.
min_censoring_counts Censoring counts ($\boldsymbol{c}^{min}$) Array of counts of user-marked censoring times within each interval defined by the successive start, break, and end points.
coordinate_origin Coordinate origin Pixel coordinates in the normalized image coordinate system representing the axes origin $(0, 0)$ of the data coordinate system.
axis_limit_coordinates Axis limit coordinate Pixel coordinates representing the point $(time\_axis\_max\_scale, 1.0)$ in the data coordinate system, used to establish the mapping.
origin_data_values Origin data values The numerical values in the data coordinate system corresponding to the origin of the data plot axes $(0, 0)$.
axis_limit_data_values Axis limit data values The numerical values corresponding to the maximum identifiable Time axis tick and the maximum Survival probability axis tick ($1.0$).
km_image_id KM image ID Unique identifier linking this annotation record to the metadata in RIPD_subGroup.csv and the source image in the km_img directory.

Variable definitions for RA_KM.csv

Variable Label Description
km_image_id KM image ID Unique identifier for the source KM image, where each image contains a pair of KM curves.
source_study_title Source study title Full title of the source study from which the KM image was extracted.
hr Hazard ratio (HR) The hazard ratio (HR) reported in the source study specifically corresponding to this KM curve image.
hr_ci_lower HR lower CI Reported lower bound of the 95% confidence interval for the hazard ratio, specifically corresponding to this KM curve image.
hr_ci_upper HR upper CI Reported upper bound of the 95% confidence interval for the hazard ratio, specifically corresponding to this KM curve image.
reference_group Reference group Label of the reference group (exactly matches the group column in RA_subGroup.csv), serving as the denominator in the calculation of the hazard ratio (HR).
experimental_group Experimental group Label of the experimental group (exactly matches the group column in RA_subGroup.csv), serving as the numerator in the calculation of the hazard ratio (HR).
event_available Availability of number of events ($E$) Binary indicator (1 = available; 0 = unavailable) representing whether the total number of events ($E$) is reported in the source study.
at_risk_available Availability of at-risk table ($R$) Binary indicator (1 = available; 0 = unavailable) representing whether the at-risk table ($R$) is reported in the source study.
censoring_available Availability of censoring times ($C$) Binary indicator (1 = available; 0 = unavailable) representing whether censoring times ($C$) are marked on the KM curve image.

Variable definitions for RA_subGroup.csv

Variable Label Description
km_curve_id KM curve ID Unique identifier for each KM curve within a KM image, used to distinguish between the pair of curves present in the image.
km_image_id KM image ID Identifier linking to the source KM image record in RA_KM.csv.
group Group label Identifier for the specific treatment arm, matching either the reference_group or experimental_group field in the corresponding RA_KM.csv file.
sample_size Sample size ($N$) Total number of patients in the group.
number_of_events Number of events ($E$) Total number of patients with observed events in the group. Recorded as NULL if unavailable.
at_risk_table At-risk table ($R$) JSON-formatted string of time points and corresponding number at risk (format: [[time, number], ...]). Recorded as NULL if unavailable.
median_survival_time Median survival time String containing reported median survival time and 95% CI. Recorded as NULL if unavailable.
one_year_survival_rate ... five_year_survival_rate 1-year to 5-year survival rates Five separate variables recording the survival probability ($S(t)$) at years 1, 2, 3, 4, and 5. Recorded as NULL if unavailable.

Citation

Since the associated manuscript is currently unpublished, please cite this dataset using the Zenodo DOI provided on this page.

License
Creative Commons Attribution 4.0 International (CC-BY 4.0)

Files

PublishedKM.zip

Files (93.2 MB)

Name Size Download all
md5:6c50c654876fdeb99b052d7bdc431218
777.7 kB Preview Download
md5:588b4eb567f5f6a25ffa56b476740fd6
11.1 MB Preview Download
md5:0ea85e9bd115335501e72e2d1e361efa
81.3 MB Preview Download

Additional details

Funding

National Natural Science Foundation of China
T2495250

Dates

Issued
2026-02-04