Published November 27, 2025 | Version v1
Dataset Open

Explainable machine learning reveals drivers of amplifications and deletions across cancer genomes

  • 1. ROR icon NYU Langone Health
  • 2. ROR icon European Institute of Oncology

Description

Code deposited on GitHub.

 

This Zenodo contains:

 - Feature Matrix at nine CNA-length-location classes

 - Annotation results at four CNA-length classes

 - Raw and intermediate files for scripts

 

Abstract

Amplifications and deletions of genomic regions are pervasive features of cancer genomes, yet it remains largely unclear which of these focal copy number alterations (CNAs) or chromosome- and arm-level aneuploidies act as drivers of carcinogenesis and which merely reflect underlying chromosomal instability. In this study, we develop an explainable machine learning framework that predicts amplification and deletion frequencies across 11 cancer genomes by integrating genomic-structural features that shape the probability of CNA occurrence with gene-level features indicative of selection. The models achieve high performances across focal, mid-length, arm-level and chromosome-level events, revealing scale-, chromosome- and tumor-dependent selective and mechanistic forces. Local architectural features such as proximity to centromeres, telomeres, and fragile sites mainly drive focal CNAs, whereas mid-length and large-scale CNAs reflect a mixture of structural constraints, dosage sensitivity, and gene-specific selection linked to oncogenes, tumor suppressors, and essential genes. Using SHAP-based interpretability, we generate a genome-wide map that distinguishes regions whose copy-number states are best explained by selective pressure from those arising primarily through structural susceptibility, ultimately providing a web-based annotation browser, named PENNE (Prediction & Explanation of Non-neutral copy Number Events), to investigate the CNA landscape at different scales. Finally, longitudinal single-cell DNA-sequencing of Reversine-induced chromosomal-instability experiments validates the model: early aneuploidies are stochastic, but over time, chromosome arms predicted to confer selective advantage become preferentially retained. Together, our findings establish a framework for interpreting the CNA and aneuploidy landscape of cancer and for systematically uncovering their likely functional drivers.

Files

annotation-matrix.zip

Files (13.3 GB)

Name Size Download all
md5:ced43a0e8ec2e56b7a717469f473e414
91.4 MB Preview Download
md5:aa8f531a6cb94fc6a8191fdb1c14105b
9.9 GB Preview Download
md5:3a54193d05c23acafaf2a438ad2d1b34
60.9 MB Preview Download
md5:3e6b84962a8a33c1af802a0cf4573fc7
3.3 GB Preview Download