Supplementary Data for "MetaTCR: A Framework for Analyzing Batch Effects in TCR Repertoire Datasets"

HUO, Miaozhe

doi:10.5281/zenodo.18265157

Published January 16, 2026 | Version v1

Dataset Open

Supplementary Data for "MetaTCR: A Framework for Analyzing Batch Effects in TCR Repertoire Datasets"

HUO, Miaozhe (Data manager)¹

1. City University of Hong Kong

Project Overview
This repository contains the foundational reference data and processed datasets associated with MetaTCR, a computational framework designed to standardize T-cell Receptor (TCR) repertoires and mitigate batch effects in Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) data.

MetaTCR addresses the challenge of non-biological variation by constructing a population-scale "Referenced TCR Space." This allows raw TCR repertoires to be converted into fixed-dimensional feature profiles (meta-vectors), enabling robust cross-study comparison and integration. The data provided here allows researchers to reproduce the study's benchmarking results, utilize the pre-trained reference space for new data, and explore the batch correction capabilities of the framework.

Dataset Structure and Contents

The dataset is organized into a main directory named data, which contains four primary subdirectories corresponding to different data types: reference databases, metadata, processed matrices (metaTCR intermediate results), and antigen-specific data.

1. `data/database/`

This folder contains the core reference files and pre-computed embeddings used for the analysis.

TCR_reference_database.full_legnth.txt: A collection of raw TCR clonotypes assembled from CDR3, TRBV, TRBD, and TRBJ segments. These clonotypes represent a merged and deduplicated set of representative TCRs derived from various datasets.

2. `data/metadata/`

This folder contains clinical and experimental metadata.

datasets_platform_info.csv: A summary file detailing the sequencing platforms and immune repertoire bioinformatics processing pipeline tags for all PBMC datasets.
Cohort-specific CSV files (e.g., Dewitt2015.csv, Emerson2017.csv, etc.): These files contain study-specific clinical variables and sample metadata corresponding to each cohort.

3. `data/processed_data/`

This folder contains the intermediate results of the metaTCR pipeline, organized into cluster information and feature matrices.

cluster_centroids/: Contains data related to the clustering of TCR sequences.
- 1024_primary_centroids.pk: The coordinates of the cluster centroids (k=1024).
- 1024_primary_labels.pk: The assigned labels for the primary clustering.
- centroid_mapping_spectral_k96.pk: The mapping file for spectral clustering or dimensionality reduction (k=96).
primary_metatcr_mtx/: Contains the processed metaTCR matrices for each dataset.
- [StudyName].pk (e.g., Emerson2017-HIP.pk, TRACERx.pk, Snyder2017.pk): These Pickle files store the processed metaTCR matrices for each cohort, representing the quantified TCR features across samples.

4. `data/tcr_antigen_data/`

This folder contains ground-truth data linking TCR sequences to specific antigens.

McPAS-TCR_filt_ept_full_deduplicated.tsv: A filtered and deduplicated version of the McPAS-TCR database, mapping TCRs to their known epitopes and associated pathologies.
antigen_vj_vdjdb_full.tsv: A comprehensive dataset from VDJdb, containing V/J gene usage and antigen specificity information.

Files

database.zip

Files (120.3 MB)

Name	Size	Download all
database.zip md5:a088944c11c7284056f58e5d9710a000	37.7 MB	Preview Download
metadata.zip md5:0394dfda5e5dfdcb8e3d87478a5f3ab4	10.0 kB	Preview Download
processed_data.zip md5:7aa0da21dd0cace725905ce3a7390306	82.4 MB	Preview Download
tcr_antigen_data.zip md5:ea8dadb3d27db296109f3f06c97bb53c	171.9 kB	Preview Download

	All versions	This version
Views	14	14
Downloads	3	3
Data volume	120.1 MB	120.1 MB

Supplementary Data for "MetaTCR: A Framework for Analyzing Batch Effects in TCR Repertoire Datasets"

Authors/Creators

Description

Dataset Structure and Contents

1. data/database/

2. data/metadata/

3. data/processed_data/

4. data/tcr_antigen_data/

Files

database.zip

Files (120.3 MB)

1. `data/database/`

2. `data/metadata/`

3. `data/processed_data/`

4. `data/tcr_antigen_data/`