vignettes/web_only/v11_db.Rmd
v11_db.Rmd
Databases with aggregated information about immune receptor specificity provide a straightforward way to annotate your data and find condition-associated receptors. immunarch
supports tools to annotate your data using the most popular AIRR databases - VDJDB, McPAS-TCR and PIRD TBAdb.
Database annotation is a two-step process. First, you need to download database files - either full database files or filtered data obtained from the web interface of databases. After that, you can use immunarch
functions to annotate your data and visualise the results. Below you can find a guide to annotation covering both steps.
VDJDB is a curated database of T-cell receptor sequences of known antigen specificity. The database is GitHub-based and available here: https://github.com/antigenomics/vdjdb-db
Citation: Shugay M et al. VDJdb: a curated database of T-cell receptor sequences with known antigen specificity. Nucleic Acids Research 2017
It can be useful to filter out immune receptors that are not relevant from the database before working with it. For instance, if you analyse human T-cell beta repertoires, it is not necessary to keep immune receptors from other species, as well as non-TRB data. Use the web interface to VDJDB located at https://vdjdb.cdr3.net/search to filter out data. Having filtered the data and pressed the “Refresh table” button, locate the “Export” button and select the “TSV” label inside. You will download the filtered database file with a name like “SearchTable-2019-10-17 12_36_11.989.tsv”, which can be used for annotation with immunarch
.
You can use the previous method to download the full database if you set all checkmarks in the “General” section of the “CDR3” tab. However, if you want to download the raw database files, here is the step-by-step guide to the sophisticated process of VDJDB downloading and unpacking.
First, you need to install JDK 8 - Java Development Kit. If you already have it, skip this step. If you don’t, just search for the proper installation instructions for your system.
Second, you need to install Groovy - a language that is used for processing VDJDB. Go to this link and download the distribution or windows installer depending on your system. For Windows users the best way is to download Windows installer. For Linux users the easiest way is to use OS package manager such as apt, dpkg, pacman, etc. For Mac users the most seamless way is to use Homebrew.
Download the VDJDB repository from GitHub via this link: https://github.com/antigenomics/vdjdb-db/archive/master.zip
Unzip the archive and go to the unpacked “vdjdb-db-master” folder.
Go to the “src” folder.
Open your Terminal or Console and execute the following command: groovy -cp . BuildDatabase.groovy --no2fix
.
After some processing, the database files will be available at the “database” folder inside the “vdjdb-db-master” folder. You will need to provide paths to this files for the immunarch
annotation functions.
McPAS-TCR is a manually curated catalogue of pathology associated T-cell receptor sequences. The database is available at http://friedmanlab.weizmann.ac.il/McPAS-TCR/
Citation: Tickotsky N, Sagiv T, Prilusky J, Shifrut E, Friedman N (2017). McPAS-TCR: A manually-curated catalogue of pathology-associated T cell receptor sequences. Bioinformatics 33:2924-2929
The filtering feature of the database’s web interface is located in the “Search Database” tab. After processing the data, press the “Download csv” button. The downloaded file named “McPAS-TCR_search.csv” can be used for annotation with immunarch
.
To download McPAS-TCR you just need to go to http://friedmanlab.weizmann.ac.il/McPAS-TCR/ and press the “Download the complete database” button there. Note that sometimes you need to press it twice or press it in a new browser tab to start the downloading process.
TBAdb is a manually curated database of T-cell receptor (TCR) and B-cell receptor (BCR) targeting specific antigen or diseases. The database contains three parts, namely TCR-AB, TCR-GD and BCR. These three parts are aimed at collecting sequences and specificity information of TCRA and TCRB, TCR- gamma and TCR-delta and BCR separately. The database is available at https://db.cngb.org/pird/tbadb/
Citation: ZHANG W, Wang L, Liu K, Wei X, Yang K, Du W, Wang S, Guo N, Ma C, Luo L, et al. PIRD: Pan immune repertoire database. Bioinformatics(2019)
Currently there is no way to download the filtered data from TBAdb. The query functionality is available at https://db.cngb.org/pird/query/ in the “TBAdb” tab.
To download TBAdb you need to go to https://db.cngb.org/pird/tbadb/ and press the “Download TBAdb” button. Note that you should agree with the licensing agreement in order to download the database file.
After downloading the database, we can proceed to the annotation part with R. To demonstrate the applicability of R and immunarch
, we will use a common task of annotation of repertoires with Cytomegalovirus (CMV) infection.
For the start, we need to load databases into R and filter out non-human, non-TRB and non-CMV data from the input database. With databases, we follow the same philosophy as with repLoad
and vis
functions: the function dbLoad
provides a singular interface to loading and basic quering for all supported databases.
For demonstration purposes, we will process each of the supported databases below.
Download the VDJDB database following the instructions above. In the examples below, we use URLs to snippets of databases as file paths. In your own code you need to provide paths to your local database files, e.g., “/Users/yourname/Downloads/vdjdb-db-master/vdjdb.slim.txt”. Do not use the links below since they are only for testing purposes and not the actual databases!
Note that VDJDB data obtained from the web interface differs from VDJDB obtained from raw files. Check the next section for working with VDJDB search tables.
The most basic way to load VDJDB to R:
vdjdb = dbLoad("https://gitlab.com/immunomind/immunarch/raw/dev-0.5.0/private/vdjdb.slim.txt.gz", "vdjdb")
vdjdb
## # A tibble: 61,049 x 19
## gene cdr3 species antigen.epitope antigen.gene antigen.species complex.id
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 TRB CASS… Macaca… STPESANL Tat SIV 0.
## 2 TRB CASS… HomoSa… RLRAEAQVK EBNA3A EBV 1.93e 4
## 3 TRB CASS… Macaca… TTPESANL Tat SIV 0.
## 4 TRA CASN… HomoSa… GILGFVFTL M InfluenzaA 0.
## 5 TRB CASS… MusMus… HGIRNASFI M45 MCMV 2.24e24
## 6 TRB CSAS… HomoSa… KLGGALQAK IE1 CMV 8.58e 3
## 7 TRA CAVL… HomoSa… GILGFVFTL M InfluenzaA 0.
## 8 TRB CASS… HomoSa… KLGGALQAK IE1 CMV 3.44e 3
## 9 TRB CAST… MusMus… SSYRRPVGI PB1 InfluenzaA 2.28e 4
## 10 TRB CASS… HomoSa… NLVPMVATV pp65 CMV 0.
## # … with 61,039 more rows, and 12 more variables: v.segm <chr>, j.segm <chr>,
## # v.end <dbl>, j.start <dbl>, mhc.a <chr>, mhc.b <chr>, mhc.class <chr>,
## # reference.id <chr>, vdjdb.score <dbl>, Species <chr>, Chain <chr>,
## # Pathology <chr>
To load VDJDB and filter out information you need to provide .species
, .chain
and .pathology
arguments:
vdjdb = dbLoad("https://gitlab.com/immunomind/immunarch/raw/dev-0.5.0/private/vdjdb.slim.txt.gz", "vdjdb", .species = "HomoSapiens", .chain = "TRB", .pathology = "CMV")
vdjdb
## # A tibble: 18,039 x 19
## gene cdr3 species antigen.epitope antigen.gene antigen.species complex.id
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 TRB CSAS… HomoSa… KLGGALQAK IE1 CMV 8584
## 2 TRB CASS… HomoSa… KLGGALQAK IE1 CMV 3445
## 3 TRB CASS… HomoSa… NLVPMVATV pp65 CMV 0
## 4 TRB CASS… HomoSa… KLGGALQAK IE1 CMV 19396
## 5 TRB CASS… HomoSa… NLVPMVATV pp65 CMV 0
## 6 TRB CAST… HomoSa… KLGGALQAK IE1 CMV 10972
## 7 TRB CASS… HomoSa… KLGGALQAK IE1 CMV 6231
## 8 TRB CASS… HomoSa… KLGGALQAK IE1 CMV 12587
## 9 TRB CATS… HomoSa… KLGGALQAK IE1 CMV 13267
## 10 TRB CASS… HomoSa… NLVPMVATV pp65 CMV 0
## # … with 18,029 more rows, and 12 more variables: v.segm <chr>, j.segm <chr>,
## # v.end <dbl>, j.start <dbl>, mhc.a <chr>, mhc.b <chr>, mhc.class <chr>,
## # reference.id <chr>, vdjdb.score <dbl>, Species <chr>, Chain <chr>,
## # Pathology <chr>
vdjdb_st = dbLoad("https://gitlab.com/immunomind/immunarch/raw/dev-0.5.0/private/SearchTable-2019-10-17%2012_36_11.989.tsv.gz", "vdjdb-search", .species = "HomoSapiens", .chain = "TRB", .pathology = "CMV")
vdjdb_st
## # A tibble: 4,999 x 19
## complex.id Gene CDR3 V J Species `MHC A` `MHC B` `MHC class`
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 0 TRB CASS… TRBV… TRBJ… HomoSa… HLA-A*… B2M MHCI
## 2 0 TRB CAWS… TRBV… TRBJ… HomoSa… HLA-A*… B2M MHCI
## 3 0 TRB CASS… TRBV… TRBJ… HomoSa… HLA-A*… B2M MHCI
## 4 0 TRB CASS… TRBV… TRBJ… HomoSa… HLA-B*… B2M MHCI
## 5 0 TRB CASS… TRBV… TRBJ… HomoSa… HLA-B*… B2M MHCI
## 6 0 TRB CASS… TRBV… TRBJ… HomoSa… HLA-B*… B2M MHCI
## 7 0 TRB CASV… TRBV… TRBJ… HomoSa… HLA-B*… B2M MHCI
## 8 0 TRB CASS… TRBV… TRBJ… HomoSa… HLA-B*… B2M MHCI
## 9 0 TRB CASS… TRBV… TRBJ… HomoSa… HLA-B*… B2M MHCI
## 10 0 TRB CASG… TRBV… TRBJ… HomoSa… HLA-B*… B2M MHCI
## # … with 4,989 more rows, and 10 more variables: Epitope <chr>, `Epitope
## # gene` <chr>, `Epitope species` <chr>, Reference <chr>, Method <chr>,
## # Meta <chr>, CDR3fix <chr>, Score <dbl>, Chain <chr>, Pathology <chr>
mcpas = dbLoad("https://gitlab.com/immunomind/immunarch/raw/dev-0.5.0/private/McPAS-TCR.csv.gz", "mcpas", .species = "Human", .chain = "TRB", .pathology = "Cytomegalovirus (CMV)")
mcpas
## # A tibble: 2,723 x 29
## CDR3.alpha.aa CDR3.beta.aa Species Category Pathology Pathology.Mesh.…
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> CASLAPGTTNE… Human Pathoge… Cytomega… D003586
## 2 <NA> CASLQAGANEQF Human Pathoge… Cytomega… D003586
## 3 <NA> CASLSGGGEQF Human Pathoge… Cytomega… D003586
## 4 <NA> CASLVASGQET… Human Pathoge… Cytomega… D003586
## 5 <NA> CASSHRDSGNT… Human Pathoge… Cytomega… D003586
## 6 <NA> CASSSANYGYTF Human Pathoge… Cytomega… D003586
## 7 <NA> CATSDPLTASY… Human Pathoge… Cytomega… D003586
## 8 CARNTGNQF CACSLRSQGTD… Human Pathoge… Cytomega… D003586
## 9 CAGNTGNQFYFG CASSAWDRSSG… Human Pathoge… Cytomega… D003586
## 10 CAYPYNNNDMRF CASSELGGAGT… Human Pathoge… Cytomega… D003586
## # … with 2,713 more rows, and 23 more variables:
## # Additional.study.details <chr>, Antigen.identification.method <dbl>,
## # NGS <chr>, Antigen.protein <chr>, Protein.ID <chr>, Epitope.peptide <chr>,
## # Epitope.ID <dbl>, MHC <chr>, Tissue <chr>, T.Cell.Type <chr>,
## # T.cell.characteristics <chr>, CDR3.alpha.nt <chr>, TRAV <chr>, TRAJ <chr>,
## # TRBV <chr>, TRBD <chr>, TRBJ <chr>, Reconstructed.J.annotation <chr>,
## # CDR3.beta.nt <chr>, Mouse.strain <chr>, PubMed.ID <dbl>, Remarks <chr>,
## # Chain <chr>
The key immunarch
function for annotation is dbAnnotate
. As an input it requires repertoires to search in, a database to lookup from, and columns of interest such as CDR3 amino acid sequence or V gene segment names columns. If you want to try it on the test data packaged with immunarch
, execute the following line of code before proceeding further:
Just in a single line of code you are able to find all clonotypes with matching CDR3 amino acid sequences in the input data and VDJDB database:
## CDR3.aa Samples A2-i129 A2-i131 A2-i133 A2-i132 A4-i191 A4-i192 MS1
## 1: CASSLGETQYF 11 6 4 2 9 6 0 1
## 2: CASSQETQYF 10 7 3 1 2 3 3 1
## 3: CASSFQETQYF 9 3 2 2 4 2 0 2
## 4: CASSLEGYEQYF 9 0 1 2 1 3 0 1
## 5: CASSSSYEQYF 9 1 0 0 1 2 2 1
## ---
## 635: CSVGTGTYEQYF 1 0 0 0 0 1 0 0
## 636: CSVQGGAYNEQFF 1 0 1 0 0 0 0 0
## 637: CSVQGGSYNEQFF 1 0 1 0 0 0 0 0
## 638: CSVVATNEKLFF 1 0 0 1 0 0 0 0
## 639: CSVVGTGNTEAFF 1 0 0 0 0 0 0 0
## MS2 MS3 MS4 MS5 MS6
## 1: 3 1 2 5 3
## 2: 1 0 0 5 1
## 3: 1 0 4 0 2
## 4: 0 1 1 1 1
## 5: 1 0 1 2 3
## ---
## 635: 0 0 0 0 0
## 636: 0 0 0 0 0
## 637: 0 0 0 0 0
## 638: 0 0 0 0 0
## 639: 0 0 0 1 0
The “Samples” column specifies the number of samples in which the clonotype found. Other numbers in columns correspond to the clonal count of the clonotype in a specific input sample.
In the next example we will search the McPAS-TCR database for condition-associated sequences using both CDR3 amino acid sequences and V gene segments:
## CDR3.aa V.name Samples A2-i129 A2-i131 A2-i133 A2-i132 A4-i191
## 1: CASSLAPGATNEKLFF TRBV7-6 4 0 0 0 0 0
## 2: CASSLGENIQYF TRBV13 4 0 0 0 0 1
## 3: CAISESYEQYF TRBV10-3 3 0 1 0 0 0
## 4: CASSLGRETQYF TRBV28 3 0 0 0 0 0
## 5: CSVGTGGTNEKLFF TRBV29-1 3 0 0 0 0 0
## ---
## 2123: KNPTAF TRBV19 1 0 0 0 0 0
## 2124: LLGGQETQYF TRBV7-4 1 0 0 0 0 0
## 2125: WASSFQGFTEAF TRBV28 1 0 0 0 0 0
## 2126: WASSQALPYEQYF TRBV12-4 1 0 0 0 0 0
## 2127: WASSQQTGTIGGYTF TRBV6-5 1 0 0 0 0 0
## A4-i192 MS1 MS2 MS3 MS4 MS5 MS6
## 1: 1 0 1 0 0 1 0
## 2: 1 0 0 0 0 1 0
## 3: 1 0 0 0 0 0 0
## 4: 0 0 0 0 0 89 1
## 5: 0 0 1 0 0 0 1
## ---
## 2123: 0 0 0 0 0 0 0
## 2124: 0 0 0 0 0 0 0
## 2125: 0 0 0 0 0 0 0
## 2126: 0 0 0 0 0 0 0
## 2127: 0 0 0 0 0 0 0
If you seek to search a database for a specific set of sequences, create a data frame with them and use it as a database file:
local_db = data.frame(Seq = c("CASSDSSGGANEQFF", "CSARLAGGQETQYF"), V = c("TRBV6-4", "TRBV20-1"), stringsAsFactors = F)
dbAnnotate(immdata$data, local_db, c("CDR3.aa", "V.name"), c("Seq", "V"))
## CDR3.aa V.name Samples A2-i129 A2-i131 A2-i133 A2-i132 A4-i191
## 1: CASSDSSGGANEQFF TRBV6-4 7 1 1 2 0 5
## 2: CSARLAGGQETQYF TRBV20-1 7 1 3 0 2 1
## A4-i192 MS1 MS2 MS3 MS4 MS5 MS6
## 1: 0 0 0 2 0 0 15
## 2: 0 0 0 2 0 0 1
Visualisation with the vis()
function will be supported in the next major release of immunarch
. You can use ggplot2
to visualise distributions of found clonotypes.
immunarch
provides a very basic query interface that permits filtering by species types, chain types and pathology types only. To perform advanced filtering such as filtering by antigen epitope, you need to use R. In the most cases, filtering with the dplyr
package is the most seamless way. Here is an example on how to use dplyr
to filter out specific antigen epitopes from VDJDB:
# Load the dplyr library
library(dplyr)
# Load the database with immunarch
vdjdb = dbLoad("https://gitlab.com/immunomind/immunarch/raw/dev-0.5.0/private/vdjdb.slim.txt.gz", "vdjdb", .species = "HomoSapiens", .chain = "TRB", .pathology = "CMV")
# Check which antigen epitopes are presented in the database
table(vdjdb$antigen.epitope)
##
## ARNLVPMVATVQGQN AYAQKIFKI CPSQEPMSIYVY CVETMCNEY DEEDAIAAY
## 3 39 2 2 2
## EDVPSGKLFMHVTLG EFFWDANDIY ELKRKMIYM ELRRKMMYM FPTKDVAL
## 1 1 5 10 10
## IPSINVHHY KLGGALQAK LSEFCRVLCCYVLEE MLNIPSINV NEGVKAAW
## 93 12667 2 73 49
## NLVPMVATV QIKVRVDMV QIKVRVKMV QYDPVAALF RPHERNGFTV
## 4496 15 24 39 4
## RPHERNGFTVL TPRVTGGGAM VLEETSVML VMAPRTLIL VTEHDTLLY
## 22 207 14 1 202
## YILEETSVM YSEHPTFTSQY
## 3 53
# Filter out all non NLVPMVATV epitopes
vdjdb = vdjdb %>% filter(antigen.epitope == "NLVPMVATV")
vdjdb
## # A tibble: 4,496 x 19
## gene cdr3 species antigen.epitope antigen.gene antigen.species complex.id
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 TRB CASS… HomoSa… NLVPMVATV pp65 CMV 0
## 2 TRB CASS… HomoSa… NLVPMVATV pp65 CMV 0
## 3 TRB CASS… HomoSa… NLVPMVATV pp65 CMV 0
## 4 TRB CASS… HomoSa… NLVPMVATV pp65 CMV 0
## 5 TRB CSAD… HomoSa… NLVPMVATV pp65 CMV 0
## 6 TRB CASS… HomoSa… NLVPMVATV pp65 CMV 0
## 7 TRB CASS… HomoSa… NLVPMVATV pp65 CMV 0
## 8 TRB CASS… HomoSa… NLVPMVATV pp65 CMV 0
## 9 TRB CSVE… HomoSa… NLVPMVATV pp65 CMV 0
## 10 TRB CASS… HomoSa… NLVPMVATV pp65 CMV 0
## # … with 4,486 more rows, and 12 more variables: v.segm <chr>, j.segm <chr>,
## # v.end <dbl>, j.start <dbl>, mhc.a <chr>, mhc.b <chr>, mhc.class <chr>,
## # reference.id <chr>, vdjdb.score <dbl>, Species <chr>, Chain <chr>,
## # Pathology <chr>
##
## NLVPMVATV
## 4496
Can not find an important feature? Have a question or found a bug? Contact us at support@immunomind.io