Robust detection of SARS-CoV-2 exposure in population using T-cell repertoire profiling
Description
The dataset contains processed T-cell receptor repertoire sequencing data from >1200 individuals of different sex and age. Note that only samples with good sequencing coverage are published (>10^5 reads per file).
The main aim of our study is to find TCR sequence biomarkers and develop a bioinformatic pipeline that allows building an accurate and robust classifier that distinguishes COVID-19-convalescent donors from unexposed individuals. We performed immunosequencing of the rearranged TCR α and β regions for PBMCs. For the cohort described in this study (Cohort-I) we sequence both chains of the TCR heterodimer as both of these chains are required to properly predict antigen recognition26. We ran conventional T-cell repertoire data analysis and pre-processed data to remove low-coverage samples.
Of samples in Cohort-I which passed read count threshold, 383/377 TCR α/β samples were from healthy donors (SARS-CoV-2 PCR test negative or obtained prior to pandemic) and 890/848 were from COVID-19-positive patients. The majority of samples were accompanied by information on HLA class I and II alleles. Samples were prepared and sequenced in nine batches.
The metadata for both TCR alpha and beta repertoires contains the following information:
- sequencing_date - date when seguencing was performed
- batch_name - one of the 9 unique batch identifiers
- sample_id, patient_id - information on sample identifier and donor identifier
- COVID_status, COVID_IgG, COVID_IgM, COVID_PCR - information on COVID-19 status
- HLA-A.1, HLA-A.2, HLA-B.1, HLA-B.2, HLA-C.1, HLA-C.2 - MHC class I alleles
- HLA-DPB1.1, HLA-DPB1.2, HLA-DQB1.1, HLA-DQB1.2, HLA-DRB1.1, HLA-DRB1.2 - MHC class II alleles
- file_name - name of the corresponding file in fmba_clonotype_usage_tables.zip archive
Each file in fmba_clonotype_usage_tables.zip archive stores the information on either TCR alpha or beta repertoire. Each line in a file corresponds to the unique clonotype and each clonotype is accompanied with the following information:
- count - number of reads where the clonotype was detected
- freq - count of reads with the clonotype divided by thw whole number of reads in a sample
- cdr3nt, cdr3aa - nucleotide and amino acid sequences of TCR's CDR3 sequence
- v, d, j - the V/D/J segment name which was used for the clonotype's rearrangement
- VEnd, DStart, DEnd, JStart - information on VDJ junction positions
We proceed with selecting a set of CDR3 sequences that can serve as biomarkers and form a feature list for COVID-19 status classifier. We also validate the resulting set of clonotypes in several ways. Co-occurence of specific TCR α and β clonotypes can serve as an independent validation for biomarkers and their co-association with some specific pathogen. Additional information on donor HLAs is provided to filter the set of biomarkers based on HLA restriction: association with donor HLA serves as an additional evidence for TCR specificity to a specific set of antigens presented in a given donor and allows detecting the fingerprint of past and present infection. Furthermore, clonotypes with similar sequences can be aggregated into 'metaclonotype' biomarkers based on clonotype graph analysis.
Finally, we train various COVID-19 status classifiers on selected batches from Cohort-I data using different algorithms and incorporating different feature sets. Verification of the robustness of our results was performed using independent batches of the Cohort-I and data from Cohort-II published previously.
Files
README.md
Files
(93 Bytes)
| Name | Size | Download all |
|---|---|---|
|
md5:a99e0d0e68b8024b18b01abe1f1fc01f
|
93 Bytes | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/antigenomics/tcr-covid-classifier
- Programming language
- Python