Published January 9, 2026 | Version v1.0.1
Dataset Embargoed

utility: Collection of Tumor-Infiltrating Lymphocyte Single-Cell Experiments with TCR

Authors/Creators

  • 1. Washington University

Description

uTILity is a comprehensive, harmonized collection of publicly available single-cell RNA sequencing data from tumor-infiltrating T cells (TILs) with paired T cell receptor (TCR) sequencing. This resource aggregates data from 28 published studies spanning 13 tissue types, 420 unique patients, and over 2.6 million cells, with 1.8 million cells having associated TCR information.

Data Processing

All datasets were uniformly processed using the following pipeline:

  1. Quality Control: Cells with >10% mitochondrial genes and/or 2.5× standard deviation from the mean number of features were excluded. Doublets were identified using scDblFinder.
  2. Annotation: Automated cell type annotation was performed using:
    • SingleR with Human Primary Cell Atlas (HPCA) and Monaco reference datasets
    • Azimuth with the PBMC reference (providing L1, L2, and L3 annotations)
  3. TCR Integration: T cell receptor data was processed using scRepertoire, with clonotypes assigned based on CDR3 amino acid sequences and gene usage.

Contents

This archive contains:

  • Seurat Objects (.rds): Fully processed R objects with gene expression, cell metadata, dimensional reductions, and TCR annotations
  • AnnData Files (.h5ad): Python-compatible exports for use with scanpy, scvi-tools, and related ecosystems
  • Processed Data: Intermediate files and per-cohort objects for users who wish to work with individual studies

Cancer Types Represented

Breast, Colorectal, Lung, Melanoma, Renal, Ovarian, HNSCC, Esophageal, Biliary, Endometrial, Merkel Cell, and multi-cancer cohorts.

Tissue Types

Tumor, Normal adjacent tissue, Peripheral blood, Lymph node, Metastatic lesions, and Juxtatumoral tissue.

Usage

This data is intended for researchers studying tumor immunology, T cell biology, and computational methods for single-cell analysis. Users can leverage the harmonized annotations and TCR data for:

  • Pan-cancer T cell phenotype analysis
  • TCR repertoire studies across cancer types
  • Benchmarking integration and annotation methods
  • Training and validating machine learning models

For analysis code and the processing pipeline, see the associated GitHub repository.

File Formats

.h5ad (Hierarchical Data Format) AnnData objects compatible with the Python single-cell ecosystem.

  • X: Raw count matrix (sparse CSR)
  • obs: Cell metadata
  • var: Gene metadata
  • obsm: Embeddings (PCA, UMAP, HARMONY, etc.)

Load in Python with:

import scanpy as sc
adata = sc.read_h5ad("adata.h5ad")

Load in R with:

library(Seurat)
obj <- as.Seurat(readRDS("adata.h5ad"))

Metadata Columns

See metadata_headers.txt in the GitHub repository for complete descriptions: https://github.com/ncborcherding/utility/blob/main/summary/metadata_headers.txt

Key columns:

  • orig.ident: Sample identifier (tumor type + tissue)
  • predicted.celltype.l1/l2/l3: Azimuth annotations
  • Monaco.labels / HPCA.labels: SingleR annotations
  • CTaa: Clonotype by CDR3 amino acid sequence
  • clonalFrequency: Clone count within sample
  • clonalProportion: Clone proportion within sample

SUGGESTED CITATION FORMAT

Borcherding, N. (2025). uTILity: Comprehensive Single-Cell Tumor-Infiltrating Lymphocyte Data with Paired TCR Sequencing (Version 1.0.0) [Dataset]. Zenodo. https://doi.org/10.5281/zenodo.10211240

Files

Embargoed

The files will be made publicly available on December 11, 2026.

Reason: Finalizing and publishing data

Additional details

Dates

Available
2025-12-11

Software

Repository URL
https://github.com/ncborcherding/utility
Programming language
R , Python