A unified benchmark of synthetic data generation for clinical transcriptomic cancer cohorts
Authors/Creators
- 1. Université Grenoble Alpes, CEA, INSERM, IRIG, UA13 BGE, Grenoble, France
- 2. Pharmacology & Toxicology, Inserm, U1248, University of Limoges, CHU Limoges, Limoges, France
- 3. Interdisciplinary Research Institute of Grenoble (CEA)
Description
Achieving a trade-off between biological utility and patient privacy remains a key challenge for secure data sharing when applying transcriptomic clinical datasets to artificial intelligence in precision oncology. Here, we introduce the first benchmarking study tailored to high-dimensional clinical transcriptomic cancer data, comparing synthetic data generation methods across three clinical cancer trials. Our framework, SynOmicBench, combines standardized preprocessing with multidimensional evaluation, prioritizing downstream biological validation alongside statistical fidelity and attack-based privacy assessment. Results indicate that no single method dominated all dimensions, with Gaussian Copula achieving the most balanced performance, followed by Avatar, demonstrating that metric-based similarity alone is insufficient to ensure preservation of higher-order molecular dependencies. Synthetic data consistently reproduced biomedical signal directionality but with attenuated effect sizes and inter-replicate variability, supporting hypothesis generation when multi-seed synthesis is adopted. Collectively, this framework provides a reproducible decision-support tool for method selection and promotes biologically informed, privacy-aware adoption of synthetic data in precision oncology.
Files
synthetic_datasets.zip
Files
(3.6 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:10f57d6cd78d152521b087426a1e73c8
|
3.6 GB | Preview Download |
Additional details
Funding
- Agence Nationale de la Recherche
- DIGPHAT - Multi-scale and longitudinal data modeling in pharmacology: toward digital pharmacological twins ANR-22-PESN-0017
- European Commission
- KATY - Knowledge At the Tip of Your fingers: Clinical Knowledge for Humanity 101017453
- European Commission
- CANVAS - Enhancing Cancer Vaccine Science for New Therapy Pathways 101079510
Software
- Repository URL
- https://trinhthechuong.github.io/SynOmicsBench/
- Development Status
- Active
References
- Woillard J-B, Benoist C, Destere A, et al. To be or not to be, when synthetic data meet clinical pharmacology: A focused study on pharmacogenetics. CPT Pharmacometrics Syst Pharmacol. 2025;14:82-94. doi:10.1002/psp4.13240
- Guillaudeux, M., Rousseau, O., Petot, J. et al. Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis. npj Digit. Med. 6, 37 (2023). https://doi.org/10.1038/s41746-023-00771-5
- Kaabachi, B., Despraz, J., Meurers, T. et al. A scoping review of privacy and utility metrics in medical synthetic data. npj Digit. Med. 8, 60 (2025). https://doi.org/10.1038/s41746-024-01359-3
- Giomi, Matteo & Boenisch, Franziska & Wehmeyer, Christoph & Tasnádi, Borbála. (2022). A Unified Framework for Quantifying Privacy Risk in Synthetic Data. 10.48550/arXiv.2211.10459.