Published January 16, 2026 | Version v1
Dataset Open

Supplementary Data for "MetaTCR: A Framework for Analyzing Batch Effects in TCR Repertoire Datasets"

  • 1. ROR icon City University of Hong Kong

Description

Project Overview
This repository contains the foundational reference data and processed datasets associated with MetaTCR, a computational framework designed to standardize T-cell Receptor (TCR) repertoires and mitigate batch effects in Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) data.

MetaTCR addresses the challenge of non-biological variation by constructing a population-scale "Referenced TCR Space." This allows raw TCR repertoires to be converted into fixed-dimensional feature profiles (meta-vectors), enabling robust cross-study comparison and integration. The data provided here allows researchers to reproduce the study's benchmarking results, utilize the pre-trained reference space for new data, and explore the batch correction capabilities of the framework.

Dataset Structure and Contents

The dataset is organized into a main directory named data, which contains four primary subdirectories corresponding to different data types: reference databases, metadata, processed matrices (metaTCR intermediate results), and antigen-specific data.

1. data/database/

This folder contains the core reference files and pre-computed embeddings used for the analysis.

  • TCR_reference_database.full_legnth.txt: A collection of raw TCR clonotypes assembled from CDR3, TRBV, TRBD, and TRBJ segments. These clonotypes represent a merged and deduplicated set of representative TCRs derived from various datasets.

2. data/metadata/

This folder contains clinical and experimental metadata.

  • datasets_platform_info.csv: A summary file detailing the sequencing platforms and immune repertoire bioinformatics processing pipeline tags for all PBMC datasets.
  • Cohort-specific CSV files (e.g., Dewitt2015.csv, Emerson2017.csv, etc.): These files contain study-specific clinical variables and sample metadata corresponding to each cohort.

3. data/processed_data/

This folder contains the intermediate results of the metaTCR pipeline, organized into cluster information and feature matrices.

  • cluster_centroids/: Contains data related to the clustering of TCR sequences.

    • 1024_primary_centroids.pk: The coordinates of the cluster centroids (k=1024).
    • 1024_primary_labels.pk: The assigned labels for the primary clustering.
    • centroid_mapping_spectral_k96.pk: The mapping file for spectral clustering or dimensionality reduction (k=96).
  • primary_metatcr_mtx/: Contains the processed metaTCR matrices for each dataset.

    • [StudyName].pk (e.g., Emerson2017-HIP.pk, TRACERx.pk, Snyder2017.pk): These Pickle files store the processed metaTCR matrices for each cohort, representing the quantified TCR features across samples.

4. data/tcr_antigen_data/

This folder contains ground-truth data linking TCR sequences to specific antigens.

  • McPAS-TCR_filt_ept_full_deduplicated.tsv: A filtered and deduplicated version of the McPAS-TCR database, mapping TCRs to their known epitopes and associated pathologies.
  • antigen_vj_vdjdb_full.tsv: A comprehensive dataset from VDJdb, containing V/J gene usage and antigen specificity information.

 

Files

database.zip

Files (120.3 MB)

Name Size Download all
md5:a088944c11c7284056f58e5d9710a000
37.7 MB Preview Download
md5:0394dfda5e5dfdcb8e3d87478a5f3ab4
10.0 kB Preview Download
md5:7aa0da21dd0cace725905ce3a7390306
82.4 MB Preview Download
md5:ea8dadb3d27db296109f3f06c97bb53c
171.9 kB Preview Download