TESSERA features for the TCGA Pan-Cancer Atlas
Authors/Creators
Description
TESSERA features for the TCGA Pan-Cancer Atlas
Per-variant, per-segment, and per-sample embeddings produced by TESSERA, a self-supervised foundation model for the cancer genome jointly pretrained on somatic single-nucleotide variants (SNVs) and copy-number alterations (CNAs) from the TCGA Pan-Cancer Atlas. This deposit accompanies the manuscript “A Foundation Model for the Cancer Genome” (Sidhom et al.) and provides a directly reusable representation of the TCGA Pan-Cancer Atlas, so that downstream analyses can build on TESSERA without re-running pretraining or inference.
The features come from the canonical joint SNV+CNA InfoNCE-aligned model used throughout the manuscript’s downstream analyses (variant interpretation, tumour-type classification, unsupervised molecular subtyping, prognostic stratification, and predictive-biomarker discovery).
Contents (HDF5 format):
- snv_per_variant.h5 — per-variant SNV embeddings (1,921,403 variants × 1,169 dimensions) with full variant metadata (gene, locus, alleles, variant class, VAF, cohort).
- cna_per_segment.h5 — per-segment CNA embeddings (1,823,050 segments × 688 dimensions) with full segment metadata (locus, segment mean, integer copy-number states, LOH).
- per_sample_aggregated.h5 — per-sample mean and max pools (RobustScaler-normalized per modality) for 10,040 SNV-profiled and 10,742 CNA-profiled samples (9,694 with both modalities), including the scaler parameters and per-sample token counts.
- README.md — complete file format, HDF5 schema, quick-start Python code, and citation.
Source code: https://github.com/JW-Sidhom-Lab/tessera
Pretrained model weights: https://huggingface.co/JW-Sidhom-Lab/tessera-foundation
Files
README.md
Additional details
Software
- Repository URL
- https://github.com/JW-Sidhom-Lab/tessera
- Programming language
- Python
- Development Status
- Active