Supplementary Data for "MetaTCR: A Framework for Analyzing Batch Effects in TCR Repertoire Datasets"
Description
Project Overview
This repository contains the foundational reference data and processed datasets associated with MetaTCR, a computational framework designed to standardize T-cell Receptor (TCR) repertoires and mitigate batch effects in Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) data.
MetaTCR addresses the challenge of non-biological variation by constructing a population-scale "Referenced TCR Space." This allows raw TCR repertoires to be converted into fixed-dimensional feature profiles (meta-vectors), enabling robust cross-study comparison and integration. The data provided here allows researchers to reproduce the study's benchmarking results, utilize the pre-trained reference space for new data, and explore the batch correction capabilities of the framework.
Dataset Structure and Contents
The dataset is organized into a main directory named data, which contains four primary subdirectories corresponding to different data types: reference databases, metadata, processed matrices (metaTCR intermediate results), and antigen-specific data.
1. data/database/
This folder contains the core reference files and pre-computed embeddings used for the analysis.
TCR_reference_database.full_legnth.txt: A collection of raw TCR clonotypes assembled from CDR3, TRBV, TRBD, and TRBJ segments. These clonotypes represent a merged and deduplicated set of representative TCRs derived from various datasets.
2. data/metadata/
This folder contains clinical and experimental metadata.
datasets_platform_info.csv: A summary file detailing the sequencing platforms and immune repertoire bioinformatics processing pipeline tags for all PBMC datasets.- Cohort-specific CSV files (e.g.,
Dewitt2015.csv,Emerson2017.csv, etc.): These files contain study-specific clinical variables and sample metadata corresponding to each cohort.
3. data/processed_data/
This folder contains the intermediate results of the metaTCR pipeline, organized into cluster information and feature matrices.
-
cluster_centroids/: Contains data related to the clustering of TCR sequences.1024_primary_centroids.pk: The coordinates of the cluster centroids (k=1024).1024_primary_labels.pk: The assigned labels for the primary clustering.centroid_mapping_spectral_k96.pk: The mapping file for spectral clustering or dimensionality reduction (k=96).
-
primary_metatcr_mtx/: Contains the processed metaTCR matrices for each dataset.[StudyName].pk(e.g.,Emerson2017-HIP.pk,TRACERx.pk,Snyder2017.pk): These Pickle files store the processed metaTCR matrices for each cohort, representing the quantified TCR features across samples.
4. data/tcr_antigen_data/
This folder contains ground-truth data linking TCR sequences to specific antigens.
McPAS-TCR_filt_ept_full_deduplicated.tsv: A filtered and deduplicated version of the McPAS-TCR database, mapping TCRs to their known epitopes and associated pathologies.antigen_vj_vdjdb_full.tsv: A comprehensive dataset from VDJdb, containing V/J gene usage and antigen specificity information.