Benchmark Datasets for IPD Reconstruction: Real IPD, Published KM, and Synthetic KM
Authors/Creators
Description
Overview
This repository contains three benchmarking datasets (Real IPD, Published KM, and Synthetic KM) developed to enable standardized evaluation of Individual Patient Data (IPD) reconstruction methods.
These datasets support the manuscript titled: "A Kaplan–Meier point picker with genetic optimization enables constraint-rich individual patient data reconstruction for multicenter meta-analysis", which is currently under preparation/review.
Dataset Structure
The repository is organized into three main archives:
1. Real IPD Dataset (RealIPD.zip)
Designed for single-curve reconstruction tasks. It integrates participant-level data from TCGA (21 cancer types) and GEO (210 analyses).
- RIPD_subGroup.csv: Metadata containing summary-statistic constraints (N, E, R) and identifiers.
- RIPD_markerInfo.csv: Comprehensive annotation file recording raw pixel coordinates, axis calibration, and mapped survival data.
- km_curve_images/: Directory containing the generated KM curve images (indexed by km_curve_id).
- IPD/: Directory containing the ground-truth IPD CSV files (Columns: id, time, status).
2. Published KM Dataset (PublishedKM.zip)
Derived from 280 studies (RCTs, meta-analyses) containing paired curves (Treatment vs. Control).
- Note: Due to copyright restrictions, original KM images are NOT included. Instead, we provide extracted numerical data and metadata.
- RA_KM.csv: Study-level metadata (Article titles, reported HRs, 95% CIs).
- RA_subGroup.csv: Cohort-specific statistics (N, E, R, median survival times, 1-5 year survival rates).
- RA_markerInfo.csv: Annotation data extracted from KM images. Uses 'km_curve_id' to identify individual curves within a KM image.
3. Synthetic KM Dataset (SyntheticKM.zip)
A fully controlled benchmarking environment generated via Monte Carlo simulations.
- SA_KM.csv & SA_subGroup.csv: Follows the same schema as the Published KM dataset.
- SA_markerInfo.csv: Ground-truth coordinates for synthetic curves.
- km_paired_images/: Composite KM images (paired arms).
- km_curve_images/: Separated single-arm images.
Data Dictionary and Variable Specifications
Variable definitions for RIPD_subGroup.csv
| Variable | Label | Description |
|---|---|---|
km_curve_id |
KM curve ID | Unique identifier for the KM curve image, where each image contains a single KM curve. This ID corresponds to the source image file in the km_curve_images/ directory. |
group |
Group label | Identifier for the specific study cohort or dataset series (e.g., TCGA-LUAD or GSE10300_RFS) from which the KM curve and its corresponding IPD were obtained. |
sample_size |
Sample size ($N$) | Total number of patients in the KM curve. |
number_of_events |
Number of events ($E$) | Total number of patients with observed events. |
at_risk_table |
At-risk table ($R$) | JSON-formatted string of time points and corresponding number at risk (format: [[time, number], ...]). |
Variable definitions for RIPD_markerInfo.csv
| Variable | Label | Description |
|---|---|---|
annotation_id |
Annotation ID | Unique identifier for the annotation record of a KM curve. |
key_point_coordinates |
Marked curve coordinates | Array of $(x, y)$ pixel coordinates representing the marked breakpoints and the end point of the KM curve, mapped onto a standard $1600 \times 800$ canvas. |
censoring_time_coordinates |
Censoring time coordinates | Array of $(x, y)$ pixel coordinates representing the ticks of marked censoring times, normalized to a $1600 \times 800$ canvas. |
survival_probabilities |
Survival probability | Array of survival probabilities ($\boldsymbol{p}$) corresponding to the starting point and breakpoints, mapped to the data coordinate system (range $[0, 1]$). |
time_values |
Time array | Array of time values corresponding to the starting point and breakpoints, mapped to the data coordinate system (actual time units). |
end_time |
End time | The time value corresponding to the end point ($t_M$) of the KM curve. |
min_censoring_counts |
Censoring counts ($\boldsymbol{c}^{min}$) | Array of counts of user-marked censoring times within each interval defined by the successive start, break, and end points. |
coordinate_origin |
Coordinate origin | Pixel coordinates in the normalized image coordinate system representing the axes origin $(0, 0)$ of the data coordinate system. |
axis_limit_coordinates |
Axis limit coordinate | Pixel coordinates representing the point $(time\_axis\_max\_scale, 1.0)$ in the data coordinate system, used to establish the mapping. |
origin_data_values |
Origin data values | The numerical values in the data coordinate system corresponding to the origin of the data plot axes $(0, 0)$. |
axis_limit_data_values |
Axis limit data values | The numerical values corresponding to the maximum identifiable Time axis tick and the maximum Survival probability axis tick ($1.0$). |
km_image_id |
KM image ID | Unique identifier linking this annotation record to the metadata in RIPD_subGroup.csv and the source image in the km_img directory. |
Variable definitions for RA_KM.csv
| Variable | Label | Description |
|---|---|---|
km_image_id |
KM image ID | Unique identifier for the source KM image, where each image contains a pair of KM curves. |
source_study_title |
Source study title | Full title of the source study from which the KM image was extracted. |
hr |
Hazard ratio (HR) | The hazard ratio (HR) reported in the source study specifically corresponding to this KM curve image. |
hr_ci_lower |
HR lower CI | Reported lower bound of the 95% confidence interval for the hazard ratio, specifically corresponding to this KM curve image. |
hr_ci_upper |
HR upper CI | Reported upper bound of the 95% confidence interval for the hazard ratio, specifically corresponding to this KM curve image. |
reference_group |
Reference group | Label of the reference group (exactly matches the group column in RA_subGroup.csv), serving as the denominator in the calculation of the hazard ratio (HR). |
experimental_group |
Experimental group | Label of the experimental group (exactly matches the group column in RA_subGroup.csv), serving as the numerator in the calculation of the hazard ratio (HR). |
event_available |
Availability of number of events ($E$) | Binary indicator (1 = available; 0 = unavailable) representing whether the total number of events ($E$) is reported in the source study. |
at_risk_available |
Availability of at-risk table ($R$) | Binary indicator (1 = available; 0 = unavailable) representing whether the at-risk table ($R$) is reported in the source study. |
censoring_available |
Availability of censoring times ($C$) | Binary indicator (1 = available; 0 = unavailable) representing whether censoring times ($C$) are marked on the KM curve image. |
Variable definitions for RA_subGroup.csv
| Variable | Label | Description |
|---|---|---|
km_curve_id |
KM curve ID | Unique identifier for each KM curve within a KM image, used to distinguish between the pair of curves present in the image. |
km_image_id |
KM image ID | Identifier linking to the source KM image record in RA_KM.csv. |
group |
Group label | Identifier for the specific treatment arm, matching either the reference_group or experimental_group field in the corresponding RA_KM.csv file. |
sample_size |
Sample size ($N$) | Total number of patients in the group. |
number_of_events |
Number of events ($E$) | Total number of patients with observed events in the group. Recorded as NULL if unavailable. |
at_risk_table |
At-risk table ($R$) | JSON-formatted string of time points and corresponding number at risk (format: [[time, number], ...]). Recorded as NULL if unavailable. |
median_survival_time |
Median survival time | String containing reported median survival time and 95% CI. Recorded as NULL if unavailable. |
one_year_survival_rate ... five_year_survival_rate |
1-year to 5-year survival rates | Five separate variables recording the survival probability ($S(t)$) at years 1, 2, 3, 4, and 5. Recorded as NULL if unavailable. |
Citation
Since the associated manuscript is currently unpublished, please cite this dataset using the Zenodo DOI provided on this page.
License
Creative Commons Attribution 4.0 International (CC-BY 4.0)
Files
PublishedKM.zip
Additional details
Funding
- National Natural Science Foundation of China
- T2495250
Dates
- Issued
-
2026-02-04