utility: Collection of Tumor-Infiltrating Lymphocyte Single-Cell Experiments with TCR
Description
uTILity is a comprehensive, harmonized collection of publicly available single-cell RNA sequencing data from tumor-infiltrating T cells (TILs) with paired T cell receptor (TCR) sequencing. This resource aggregates data from 28 published studies spanning 13 tissue types, 420 unique patients, and over 2.6 million cells, with 1.8 million cells having associated TCR information.
Data Processing
All datasets were uniformly processed using the following pipeline:
- Quality Control: Cells with >10% mitochondrial genes and/or 2.5× standard deviation from the mean number of features were excluded. Doublets were identified using scDblFinder.
- Annotation: Automated cell type annotation was performed using:
- SingleR with Human Primary Cell Atlas (HPCA) and Monaco reference datasets
- Azimuth with the PBMC reference (providing L1, L2, and L3 annotations)
- TCR Integration: T cell receptor data was processed using scRepertoire, with clonotypes assigned based on CDR3 amino acid sequences and gene usage.
Contents
This archive contains:
- Seurat Objects (.rds): Fully processed R objects with gene expression, cell metadata, dimensional reductions, and TCR annotations
- AnnData Files (.h5ad): Python-compatible exports for use with scanpy, scvi-tools, and related ecosystems
- Processed Data: Intermediate files and per-cohort objects for users who wish to work with individual studies
Cancer Types Represented
Breast, Colorectal, Lung, Melanoma, Renal, Ovarian, HNSCC, Esophageal, Biliary, Endometrial, Merkel Cell, and multi-cancer cohorts.
Tissue Types
Tumor, Normal adjacent tissue, Peripheral blood, Lymph node, Metastatic lesions, and Juxtatumoral tissue.
Usage
This data is intended for researchers studying tumor immunology, T cell biology, and computational methods for single-cell analysis. Users can leverage the harmonized annotations and TCR data for:
- Pan-cancer T cell phenotype analysis
- TCR repertoire studies across cancer types
- Benchmarking integration and annotation methods
- Training and validating machine learning models
For analysis code and the processing pipeline, see the associated GitHub repository.
File Formats
.h5ad (Hierarchical Data Format) AnnData objects compatible with the Python single-cell ecosystem.
- X: Raw count matrix (sparse CSR)
- obs: Cell metadata
- var: Gene metadata
- obsm: Embeddings (PCA, UMAP, HARMONY, etc.)
Load in Python with:
import scanpy as sc
adata = sc.read_h5ad("adata.h5ad")
Load in R with:
library(Seurat)
obj <- as.Seurat(readRDS("adata.h5ad"))
Metadata Columns
See metadata_headers.txt in the GitHub repository for complete descriptions: https://github.com/ncborcherding/utility/blob/main/summary/metadata_headers.txt
Key columns:
- orig.ident: Sample identifier (tumor type + tissue)
- predicted.celltype.l1/l2/l3: Azimuth annotations
- Monaco.labels / HPCA.labels: SingleR annotations
- CTaa: Clonotype by CDR3 amino acid sequence
- clonalFrequency: Clone count within sample
- clonalProportion: Clone proportion within sample
SUGGESTED CITATION FORMAT
Borcherding, N. (2025). uTILity: Comprehensive Single-Cell Tumor-Infiltrating Lymphocyte Data with Paired TCR Sequencing (Version 1.0.0) [Dataset]. Zenodo. https://doi.org/10.5281/zenodo.10211240
Additional details
Dates
- Available
-
2025-12-11
Software
- Repository URL
- https://github.com/ncborcherding/utility
- Programming language
- R , Python