A synthetic data generation pipeline to reproducibly mirror high-resolution multi-variable peptidomics and real-patient clinical data

Frantzi, Maria

doi:10.1101/2024.10.30.24316342

Published November 21, 2024 | Version v1

Dataset Open

A synthetic data generation pipeline to reproducibly mirror high-resolution multi-variable peptidomics and real-patient clinical data

Frantzi, Maria (Contact person)

Generating high quality, real-world clinical and molecular datasets is challenging, costly and time intensive. Consequently, such data should be shared with the scientific community, which however carries the risk of privacy breaches. The latter limitation hinders the scientific community’s ability to freely share and access high resolution and high quality data, which are essential especially in the context of personalised medicine.

In this study, we present an algorithm based on Gaussian copulas to generate synthetic data that retain associations within high dimensional (peptidomics) datasets. For this purpose, 3,881 datasets from 10 cohorts were employed, containing clinical, demographic, molecular (> 21,500 peptide) variables, and outcome data for individuals with a kidney or a heart failure event. High dimensional copulas were developed to portray the distribution matrix between the clinical and peptidomics data in the dataset, and based on these distributions, a data matrix of 2,000 synthetic patients was developed. Synthetic data maintained the capacity to reproducibly correlate the peptidomics data with the clinical variables.

External validation was performed, using independent multi-centric datasets (n = 2,964) of individuals with chronic kidney disease (CKD, defined as eGFR < 60 mL/min/1.73m²) or those with normal kidney function (eGFR > 90 mL/min/1.73m²). Similarly, the association of the rho-values of single peptides with eGFR between the synthetic and the external validation datasets was significantly reproduced (rho = 0.569, p = 1.8e-218). Subsequent development of classifiers by using the synthetic data matrices, resulted in highly predictive values in external real-patient datasets (AUC values of 0.803 and 0.867 for HF and CKD, respectively), demonstrating robustness of the developed method in the generation of synthetic patient data. The proposed pipeline represents a solution for high-dimensional sharing while maintaining patient confidentiality.

For this study 6,967 peptidomics mass spectrometry datasets were employed and are deposited here, including:

3,881 datasets that were employed for synthetic data generation

1) File name: hf_peptides_data.csv; size: 45.56 MB; Description: 472 datasets from patients developing a heart failure event

2) File name: ckd_peptides_data.csv; size: 10.98 MB; Description: 242 datasets from patients developing a kidney event

3) File name: no_event_peptides_fdata.csv; size: 194.70 MB; Description: 3,266 datasets from patients that did not develop any event

2,964 datasets that were used as external validation datasets (chronic kidney disease group

*Study 1: PersTIgAN

4) File name: PersTIgAN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size: 37.7MB; Description: Patients with CKD_Study1_export 1

5) File name: PersTIgAN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size: 2.6 MB; Description: Patients with CKD_Study1_export 2

*Study 2: CKD_Biobay

6) File name: CKD_BioBay_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size: 35.7 MB; Description: Patients with CKD_Study2_export 1

7) File name: CKD_BioBay_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_2.xls; size: 26.0 MB; Description: Patients with CKD_Study2_export 2

*Study 3: DC_Ren
8) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size: 37.96 MB; Description: Patients with CKD_Study3_export 1

9) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_2.xls; size: 38.13 MB; Description: Patients with CKD_Study3_export 2

10) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_3.xls; size: 36.86 MB; Description: Patients with CKD_Study3_export 3

11) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_4.xls; size: 38.39 MB; Description: Patients with CKD_Study3_export 4

12) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_5.xls; size: 38.12 MB; Description: Patients with CKD_Study3_export 5

13) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_6.xls; size: 36.73 MB; Description: Patients with CKD_Study3_export 6

14) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_7.xls; size: 2.15 MB; Description: Patients with CKD_Study3_export 7

*Non-CKD

15) File name: NonCKD_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size: 37.72 MB; Description: datasets from patients without CKD_export 1

16) File name: NonCKD_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_2.xls; size: 38.31MB; Description: datasets from patients without CKD_export 2

17) File name: NonCKD_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_3.xls; size: 36.95 MB; Description: datasets from patients without CKD_export 3

122 datasets that were used as external validation datasets (heart failure group)

7) File name: HF_external_case__MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot.xls; size: 3.13 MB; Description: datasets from patients that develop heart failure

8) File name: HF_external_Control_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot.xls; size: 3.94 MB; Description: datasets from patients that did not develop heart failure

Files

ckd_peptides_data.csv

Files (701.6 MB)

Name	Size	Download all
CKD_BioBay_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls md5:26be69d4da9a5a6aff3ec834367a8aad	35.7 MB	Download
CKD_BioBay_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_2.xls md5:d3e79901b346b4fc875960fe34a45869	26.0 MB	Download
ckd_peptides_data.csv md5:048a26b5026b40534b40325bc96ded99	11.0 MB	Preview Download
DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_4.xls md5:a36c208140c3203527de68e6ff1ed37f	38.4 MB	Download
DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_5.xls md5:3d7298496717bf75865aab5f7ac09c61	38.1 MB	Download
DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_6.xls md5:6f0b6f67356318d64e3d296c0cdf0d1d	36.7 MB	Download
DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls md5:d01ecfe6bbe9b886a287ab482add3352	38.0 MB	Download
DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_2.xls md5:c8688bf5063187c74daf7cfd79dd307d	38.1 MB	Download
DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_3.xls md5:c2a8a34a4dcd84dab2ea9cfc9d46eedd	36.9 MB	Download
DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_7.xls md5:06d75ecd7216286385d567a880d00bc1	2.2 MB	Download
HF_external_case__MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot.xls.txt md5:0b177236b13e825b9c43fbf4f5faa3af	3.1 MB	Preview Download
HF_external_Control_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot.xls.txt md5:24140c8023d4fd1786cdfe2b63ce61c1	3.9 MB	Preview Download
hf_peptides_data.csv md5:b53c2d7a5c5848bef2daee875f8a3125	45.6 MB	Preview Download
no_event_peptides_fdata.csv md5:b142972ed632d973620e515991f1c00c	194.7 MB	Preview Download
NonCKD_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls md5:e22f081997d66c7e677f10faf7656c3b	37.7 MB	Download
NonCKD_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_2.xls md5:56c0ef1c83a1a2cbeb3978d3620dd5c1	38.3 MB	Download
NonCKD_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_3.xls md5:61396f5e336fe3c6bb9abb69dd3ba0ef	37.0 MB	Download
PersTIgAN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls md5:ab3979dc4bdde3517dc10781b7961d7b	37.7 MB	Download
PersTIgAN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_2.xls md5:1e0a6aec2c269c11eca54633e92fc5b2	2.6 MB	Download

Additional details

Is supplement to: Dataset: medrxiv;DOI:10.1101/2024.10.30.24316342 (Other)

European Commission
DisCo-I (HORIZON-MSCA-2021-DN-ID) 101072828
European Commission
MULTIR (HORIZON-MISS-2023-CANCER-01-01) 101136926
European Cooperation in Science and Technology
PERMEDIK COST Action CA21165
European Commission
DC-ren (Horizon 2020 research and innovation) 848011
Federal Ministry of Education and Research
UPTAKE 01EK2105A, 01EK2105B, 01EK2105C
Federal Ministry of Education and Research
SIGNAL 01KU2307
Federal Ministry of Education and Research
ProSTRAT-AI 01DS23014
Federal Ministry for Economic Affairs and Climate Action
Accurate-CVD KK5560002AP3

Repository URL: https://github.com/Atomic-Intelligence/Peptide-synthesis.git
Programming language: Python

	All versions	This version
Views	86	86
Downloads	317	317
Data volume	11.6 GB	11.6 GB

A synthetic data generation pipeline to reproducibly mirror high-resolution multi-variable peptidomics and real-patient clinical data

Files

ckd_peptides_data.csv

Files (701.6 MB)

Additional details

Related works

Funding

Software

A synthetic data generation pipeline to reproducibly mirror high-resolution multi-variable peptidomics and real-patient clinical data

Creators

Description

Files

ckd_peptides_data.csv

Files (701.6 MB)

Additional details

Related works

Funding

Software