TESSERA features for the TCGA Pan-Cancer Atlas

Sidhom, John-William; Baras, Alexander; Elemento, Olivier; Shah, Manish

doi:10.5281/zenodo.20419467

Published May 28, 2026 | Version v1

Dataset Open

TESSERA features for the TCGA Pan-Cancer Atlas

1. Weill Cornell Medicine
2. Johns Hopkins University
3. Weill Cornell Medical College
4. Cornell University

TESSERA features for the TCGA Pan-Cancer Atlas

Per-variant, per-segment, and per-sample embeddings produced by TESSERA, a self-supervised foundation model for the cancer genome jointly pretrained on somatic single-nucleotide variants (SNVs) and copy-number alterations (CNAs) from the TCGA Pan-Cancer Atlas. This deposit accompanies the manuscript “A Foundation Model for the Cancer Genome” (Sidhom et al.) and provides a directly reusable representation of the TCGA Pan-Cancer Atlas, so that downstream analyses can build on TESSERA without re-running pretraining or inference.

The features come from the canonical joint SNV+CNA InfoNCE-aligned model used throughout the manuscript’s downstream analyses (variant interpretation, tumour-type classification, unsupervised molecular subtyping, prognostic stratification, and predictive-biomarker discovery).

Contents (HDF5 format):

snv_per_variant.h5 — per-variant SNV embeddings (1,921,403 variants × 1,169 dimensions) with full variant metadata (gene, locus, alleles, variant class, VAF, cohort).
cna_per_segment.h5 — per-segment CNA embeddings (1,823,050 segments × 688 dimensions) with full segment metadata (locus, segment mean, integer copy-number states, LOH).
per_sample_aggregated.h5 — per-sample mean and max pools (RobustScaler-normalized per modality) for 10,040 SNV-profiled and 10,742 CNA-profiled samples (9,694 with both modalities), including the scaler parameters and per-sample token counts.
README.md — complete file format, HDF5 schema, quick-start Python code, and citation.

Source code: https://github.com/JW-Sidhom-Lab/tessera
Pretrained model weights: https://huggingface.co/JW-Sidhom-Lab/tessera-foundation

Files

README.md

Files (12.7 GB)

Name	Size	Download all
cna_per_segment.h5 md5:fc593f8fc675242b7e2d4d4655268c2b	4.4 GB	Download
per_sample_aggregated.h5 md5:58d3333e2ea40aeeabb2e3ac2807a730	129.1 MB	Download
README.md md5:0110e7ce96664b645496e75656a06353	9.7 kB	Preview Download
snv_per_variant.h5 md5:a7cd8157f74e726aab858c2a2cd382d3	8.2 GB	Download

Additional details

Repository URL: https://github.com/JW-Sidhom-Lab/tessera
Programming language: Python
Development Status: Active

	All versions	This version
Views	4	4
Downloads	1	1
Data volume	9.7 kB	9.7 kB

TESSERA features for the TCGA Pan-Cancer Atlas

Authors/Creators

Description

Files

README.md

Files (12.7 GB)

Additional details

Software