Published May 28, 2026 | Version v1
Dataset Open

TESSERA features for the TCGA Pan-Cancer Atlas

  • 1. ROR icon Weill Cornell Medicine
  • 2. ROR icon Johns Hopkins University
  • 3. Weill Cornell Medical College
  • 4. ROR icon Cornell University

Description

TESSERA features for the TCGA Pan-Cancer Atlas

Per-variant, per-segment, and per-sample embeddings produced by TESSERA, a self-supervised foundation model for the cancer genome jointly pretrained on somatic single-nucleotide variants (SNVs) and copy-number alterations (CNAs) from the TCGA Pan-Cancer Atlas. This deposit accompanies the manuscript “A Foundation Model for the Cancer Genome” (Sidhom et al.) and provides a directly reusable representation of the TCGA Pan-Cancer Atlas, so that downstream analyses can build on TESSERA without re-running pretraining or inference.

The features come from the canonical joint SNV+CNA InfoNCE-aligned model used throughout the manuscript’s downstream analyses (variant interpretation, tumour-type classification, unsupervised molecular subtyping, prognostic stratification, and predictive-biomarker discovery).

Contents (HDF5 format):

  • snv_per_variant.h5 — per-variant SNV embeddings (1,921,403 variants × 1,169 dimensions) with full variant metadata (gene, locus, alleles, variant class, VAF, cohort).
  • cna_per_segment.h5 — per-segment CNA embeddings (1,823,050 segments × 688 dimensions) with full segment metadata (locus, segment mean, integer copy-number states, LOH).
  • per_sample_aggregated.h5 — per-sample mean and max pools (RobustScaler-normalized per modality) for 10,040 SNV-profiled and 10,742 CNA-profiled samples (9,694 with both modalities), including the scaler parameters and per-sample token counts.
  • README.md — complete file format, HDF5 schema, quick-start Python code, and citation.

Source code: https://github.com/JW-Sidhom-Lab/tessera
Pretrained model weights: https://huggingface.co/JW-Sidhom-Lab/tessera-foundation

Files

README.md

Files (12.7 GB)

Name Size Download all
md5:fc593f8fc675242b7e2d4d4655268c2b
4.4 GB Download
md5:58d3333e2ea40aeeabb2e3ac2807a730
129.1 MB Download
md5:0110e7ce96664b645496e75656a06353
9.7 kB Preview Download
md5:a7cd8157f74e726aab858c2a2cd382d3
8.2 GB Download

Additional details

Software

Repository URL
https://github.com/JW-Sidhom-Lab/tessera
Programming language
Python
Development Status
Active