Published November 21, 2024 | Version v1
Dataset Open

A synthetic data generation pipeline to reproducibly mirror high-resolution multi-variable peptidomics and real-patient clinical data

Description

Generating high quality, real-world clinical and molecular datasets is challenging, costly and time intensive. Consequently, such data should be shared with the scientific community, which however carries the risk of privacy breaches. The latter limitation hinders the scientific community’s ability to freely share and access high resolution and high quality data, which are essential especially in the context of personalised medicine.

In this study, we present an algorithm based on Gaussian copulas to generate synthetic data that retain associations within high dimensional (peptidomics) datasets. For this purpose, 3,881 datasets from 10 cohorts were employed, containing clinical, demographic, molecular (> 21,500 peptide) variables, and outcome data for individuals with a kidney or a heart failure event. High dimensional copulas were developed to portray the distribution matrix between the clinical and peptidomics data in the dataset, and based on these distributions, a data matrix of 2,000 synthetic patients was developed. Synthetic data maintained the capacity to reproducibly correlate the peptidomics data with the clinical variables. 

External validation was performed, using independent multi-centric datasets (n = 2,964) of individuals with chronic kidney disease (CKD, defined as eGFR < 60 mL/min/1.73m²) or those with normal kidney function (eGFR > 90 mL/min/1.73m²). Similarly, the association of the rho-values of single peptides with eGFR between the synthetic and the external validation datasets was significantly reproduced (rho = 0.569, p = 1.8e-218). Subsequent development of classifiers by using the synthetic data matrices, resulted in highly predictive values in external real-patient datasets (AUC values of 0.803 and 0.867 for HF and CKD, respectively), demonstrating robustness of the developed method in the generation of synthetic patient data. The proposed pipeline represents a solution for high-dimensional sharing while maintaining patient confidentiality.

 

For this study 6,967 peptidomics mass spectrometry datasets were employed and are deposited here, including: 

  • 3,881 datasets that were employed for synthetic data generation

1) File name: hf_peptides_data.csv; size: 45.56 MB; Description: 472 datasets from patients developing a heart failure event

2) File name: ckd_peptides_data.csv; size: 10.98 MB; Description: 242 datasets from patients developing a kidney event

3) File name: no_event_peptides_fdata.csv; size: 194.70 MB; Description: 3,266 datasets from patients that did not develop any event

 

  • 2,964 datasets that were used as external validation datasets (chronic kidney disease group

*Study 1: PersTIgAN

4) File name: PersTIgAN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size:  37.7MB; Description: Patients with CKD_Study1_export 1

5) File name: PersTIgAN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size: 2.6 MB; Description: Patients with CKD_Study1_export 2

 

*Study 2: CKD_Biobay

6) File name: CKD_BioBay_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size:  35.7 MB; Description: Patients with CKD_Study2_export 1

7) File name: CKD_BioBay_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_2.xls; size:  26.0 MB; Description: Patients with CKD_Study2_export 2

 

*Study 3: DC_Ren
8) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size:  37.96 MB; Description: Patients with CKD_Study3_export 1

9) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_2.xls; size:  38.13 MB; Description: Patients with CKD_Study3_export 2

10) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_3.xls; size: 36.86 MB; Description: Patients with CKD_Study3_export 3

11) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_4.xls; size: 38.39 MB; Description: Patients with CKD_Study3_export 4

12) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_5.xls; size: 38.12 MB; Description: Patients with CKD_Study3_export 5

13) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_6.xls; size: 36.73 MB; Description: Patients with CKD_Study3_export 6

14) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_7.xls; size: 2.15 MB; Description: Patients with CKD_Study3_export 7

 

*Non-CKD 

15) File name: NonCKD_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size: 37.72 MB; Description: datasets from patients without CKD_export 1

16) File name: NonCKD_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_2.xls; size:  38.31MB; Description: datasets from patients without CKD_export 2

17) File name: NonCKD_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_3.xls; size:  36.95 MB; Description: datasets from patients without CKD_export 3

 

  • 122 datasets that were used as external validation datasets (heart failure group) 

7) File name: HF_external_case__MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot.xls; size: 3.13 MB; Description:  datasets from patients that develop heart failure

8) File name: HF_external_Control_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot.xls; size: 3.94 MB; Description:  datasets from patients that did not develop heart failure 

Files

ckd_peptides_data.csv

Files (701.6 MB)

Name Size Download all
md5:26be69d4da9a5a6aff3ec834367a8aad
35.7 MB Download
md5:d3e79901b346b4fc875960fe34a45869
26.0 MB Download
md5:048a26b5026b40534b40325bc96ded99
11.0 MB Preview Download
md5:a36c208140c3203527de68e6ff1ed37f
38.4 MB Download
md5:3d7298496717bf75865aab5f7ac09c61
38.1 MB Download
md5:6f0b6f67356318d64e3d296c0cdf0d1d
36.7 MB Download
md5:d01ecfe6bbe9b886a287ab482add3352
38.0 MB Download
md5:c8688bf5063187c74daf7cfd79dd307d
38.1 MB Download
md5:c2a8a34a4dcd84dab2ea9cfc9d46eedd
36.9 MB Download
md5:06d75ecd7216286385d567a880d00bc1
2.2 MB Download
md5:0b177236b13e825b9c43fbf4f5faa3af
3.1 MB Preview Download
md5:24140c8023d4fd1786cdfe2b63ce61c1
3.9 MB Preview Download
md5:b53c2d7a5c5848bef2daee875f8a3125
45.6 MB Preview Download
md5:b142972ed632d973620e515991f1c00c
194.7 MB Preview Download
md5:e22f081997d66c7e677f10faf7656c3b
37.7 MB Download
md5:56c0ef1c83a1a2cbeb3978d3620dd5c1
38.3 MB Download
md5:61396f5e336fe3c6bb9abb69dd3ba0ef
37.0 MB Download
md5:ab3979dc4bdde3517dc10781b7961d7b
37.7 MB Download
md5:1e0a6aec2c269c11eca54633e92fc5b2
2.6 MB Download

Additional details

Related works

Is supplement to
Dataset: medrxiv;DOI:10.1101/2024.10.30.24316342 (Other)

Funding

European Commission
DisCo-I (HORIZON-MSCA-2021-DN-ID) 101072828
European Commission
MULTIR (HORIZON-MISS-2023-CANCER-01-01) 101136926
European Cooperation in Science and Technology
PERMEDIK COST Action CA21165
European Commission
DC-ren (Horizon 2020 research and innovation) 848011
Federal Ministry of Education and Research
UPTAKE 01EK2105A, 01EK2105B, 01EK2105C
Federal Ministry of Education and Research
SIGNAL 01KU2307
Federal Ministry of Education and Research
ProSTRAT-AI 01DS23014
Federal Ministry for Economic Affairs and Climate Action
Accurate-CVD KK5560002AP3

Software

Repository URL
https://github.com/Atomic-Intelligence/Peptide-synthesis.git
Programming language
Python