A synthetic data generation pipeline to reproducibly mirror high-resolution multi-variable peptidomics and real-patient clinical data
Creators
Description
Generating high quality, real-world clinical and molecular datasets is challenging, costly and time intensive. Consequently, such data should be shared with the scientific community, which however carries the risk of privacy breaches. The latter limitation hinders the scientific community’s ability to freely share and access high resolution and high quality data, which are essential especially in the context of personalised medicine.
In this study, we present an algorithm based on Gaussian copulas to generate synthetic data that retain associations within high dimensional (peptidomics) datasets. For this purpose, 3,881 datasets from 10 cohorts were employed, containing clinical, demographic, molecular (> 21,500 peptide) variables, and outcome data for individuals with a kidney or a heart failure event. High dimensional copulas were developed to portray the distribution matrix between the clinical and peptidomics data in the dataset, and based on these distributions, a data matrix of 2,000 synthetic patients was developed. Synthetic data maintained the capacity to reproducibly correlate the peptidomics data with the clinical variables.
External validation was performed, using independent multi-centric datasets (n = 2,964) of individuals with chronic kidney disease (CKD, defined as eGFR < 60 mL/min/1.73m²) or those with normal kidney function (eGFR > 90 mL/min/1.73m²). Similarly, the association of the rho-values of single peptides with eGFR between the synthetic and the external validation datasets was significantly reproduced (rho = 0.569, p = 1.8e-218). Subsequent development of classifiers by using the synthetic data matrices, resulted in highly predictive values in external real-patient datasets (AUC values of 0.803 and 0.867 for HF and CKD, respectively), demonstrating robustness of the developed method in the generation of synthetic patient data. The proposed pipeline represents a solution for high-dimensional sharing while maintaining patient confidentiality.
For this study 6,967 peptidomics mass spectrometry datasets were employed and are deposited here, including:
- 3,881 datasets that were employed for synthetic data generation
1) File name: hf_peptides_data.csv; size: 45.56 MB; Description: 472 datasets from patients developing a heart failure event
2) File name: ckd_peptides_data.csv; size: 10.98 MB; Description: 242 datasets from patients developing a kidney event
3) File name: no_event_peptides_fdata.csv; size: 194.70 MB; Description: 3,266 datasets from patients that did not develop any event
- 2,964 datasets that were used as external validation datasets (chronic kidney disease group
*Study 1: PersTIgAN
4) File name: PersTIgAN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size: 37.7MB; Description: Patients with CKD_Study1_export 1
5) File name: PersTIgAN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size: 2.6 MB; Description: Patients with CKD_Study1_export 2
*Study 2: CKD_Biobay
6) File name: CKD_BioBay_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size: 35.7 MB; Description: Patients with CKD_Study2_export 1
7) File name: CKD_BioBay_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_2.xls; size: 26.0 MB; Description: Patients with CKD_Study2_export 2
*Study 3: DC_Ren
8) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size: 37.96 MB; Description: Patients with CKD_Study3_export 1
9) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_2.xls; size: 38.13 MB; Description: Patients with CKD_Study3_export 2
10) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_3.xls; size: 36.86 MB; Description: Patients with CKD_Study3_export 3
11) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_4.xls; size: 38.39 MB; Description: Patients with CKD_Study3_export 4
12) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_5.xls; size: 38.12 MB; Description: Patients with CKD_Study3_export 5
13) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_6.xls; size: 36.73 MB; Description: Patients with CKD_Study3_export 6
14) File name: DCREN_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_1_Pivot_Blatt_7.xls; size: 2.15 MB; Description: Patients with CKD_Study3_export 7
*Non-CKD
15) File name: NonCKD_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_1.xls; size: 37.72 MB; Description: datasets from patients without CKD_export 1
16) File name: NonCKD_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_2.xls; size: 38.31MB; Description: datasets from patients without CKD_export 2
17) File name: NonCKD_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot_Blatt_3.xls; size: 36.95 MB; Description: datasets from patients without CKD_export 3
- 122 datasets that were used as external validation datasets (heart failure group)
7) File name: HF_external_case__MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot.xls; size: 3.13 MB; Description: datasets from patients that develop heart failure
8) File name: HF_external_Control_MosaID_1_7_5_MFinder_vs_MV_HybridSolution_v4_ML1_Pivot.xls; size: 3.94 MB; Description: datasets from patients that did not develop heart failure
Files
ckd_peptides_data.csv
Files
(701.6 MB)
Name | Size | Download all |
---|---|---|
md5:26be69d4da9a5a6aff3ec834367a8aad
|
35.7 MB | Download |
md5:d3e79901b346b4fc875960fe34a45869
|
26.0 MB | Download |
md5:048a26b5026b40534b40325bc96ded99
|
11.0 MB | Preview Download |
md5:a36c208140c3203527de68e6ff1ed37f
|
38.4 MB | Download |
md5:3d7298496717bf75865aab5f7ac09c61
|
38.1 MB | Download |
md5:6f0b6f67356318d64e3d296c0cdf0d1d
|
36.7 MB | Download |
md5:d01ecfe6bbe9b886a287ab482add3352
|
38.0 MB | Download |
md5:c8688bf5063187c74daf7cfd79dd307d
|
38.1 MB | Download |
md5:c2a8a34a4dcd84dab2ea9cfc9d46eedd
|
36.9 MB | Download |
md5:06d75ecd7216286385d567a880d00bc1
|
2.2 MB | Download |
md5:0b177236b13e825b9c43fbf4f5faa3af
|
3.1 MB | Preview Download |
md5:24140c8023d4fd1786cdfe2b63ce61c1
|
3.9 MB | Preview Download |
md5:b53c2d7a5c5848bef2daee875f8a3125
|
45.6 MB | Preview Download |
md5:b142972ed632d973620e515991f1c00c
|
194.7 MB | Preview Download |
md5:e22f081997d66c7e677f10faf7656c3b
|
37.7 MB | Download |
md5:56c0ef1c83a1a2cbeb3978d3620dd5c1
|
38.3 MB | Download |
md5:61396f5e336fe3c6bb9abb69dd3ba0ef
|
37.0 MB | Download |
md5:ab3979dc4bdde3517dc10781b7961d7b
|
37.7 MB | Download |
md5:1e0a6aec2c269c11eca54633e92fc5b2
|
2.6 MB | Download |
Additional details
Related works
- Is supplement to
- Dataset: medrxiv;DOI:10.1101/2024.10.30.24316342 (Other)
Funding
- European Commission
- DisCo-I (HORIZON-MSCA-2021-DN-ID) 101072828
- European Commission
- MULTIR (HORIZON-MISS-2023-CANCER-01-01) 101136926
- European Cooperation in Science and Technology
- PERMEDIK COST Action CA21165
- European Commission
- DC-ren (Horizon 2020 research and innovation) 848011
- Federal Ministry of Education and Research
- UPTAKE 01EK2105A, 01EK2105B, 01EK2105C
- Federal Ministry of Education and Research
- SIGNAL 01KU2307
- Federal Ministry of Education and Research
- ProSTRAT-AI 01DS23014
- Federal Ministry for Economic Affairs and Climate Action
- Accurate-CVD KK5560002AP3
Software
- Repository URL
- https://github.com/Atomic-Intelligence/Peptide-synthesis.git
- Programming language
- Python