Repertoire overlap is the most common approach to measure repertoire similarity. It is achieved by computation of specific statistics on clonotypes shared between given repertoires, also called “public” clonotypes. immunarch
provides several indices: - number of public clonotypes (.method = "public"
) - a classic measure of overlap similarity.
overlap coefficient (.method = "overlap"
) - a normalised measure of overlap similarity. It is defined as the size of the intersection divided by the smaller of the size of the two sets.
Jaccard index (.method = "jaccard"
) - it measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.
Tversky index (.method = "tversky"
) - an asymmetric similarity measure on sets that compares a variant to a prototype. If using default arguments, it’s similar to Dice’s coefficient.
cosine similarity (.method = "cosine"
) - a measure of similarity between two non-zero vectors
Morisita’s overlap index (.method = "morisita"
) - a statistical measure of dispersion of individuals in a population. It is used to compare overlap among samples.
incremental overlap - overlaps of the N most abundant clonotypes with incrementally growing N (.method = "inc+METHOD"
, e.g., "inc+public"
or "inc+morisita"
).
The function that includes described methods is repOverlap
. Again the output is easily visualised when passed to vis()
function that does all the work:
imm_ov1 = repOverlap(immdata$data, .method = "public", .verbose = F)
imm_ov2 = repOverlap(immdata$data, .method = "morisita", .verbose = F)
grid.arrange(vis(imm_ov1), vis(imm_ov2, .text.size=2), ncol = 2)
vis(imm_ov1, "heatmap2")
You can easily change the number of significant digits:
grid.arrange(vis(imm_ov2, .text.size=2.5, .signif.digits=1), vis(imm_ov2, .text.size=2, .signif.digits=2), ncol = 2)
To analyse the computed overlap measures function apply repOverlapAnalysis
.
# Apply different analysis algorithms to the matrix of public clonotypes:
# "mds" - Multi-dimensional Scaling
repOverlapAnalysis(imm_ov1, "mds")
## Standard deviations (1, .., p=4):
## [1] 0 0 0 0
##
## Rotation (n x k) = (12 x 2):
## [,1] [,2]
## A2-i129 -20.2308709 22.431389
## A2-i131 8.3055445 -26.779321
## A2-i133 45.9341813 -6.893304
## A2-i132 -55.0903957 -18.572513
## A4-i191 23.7461189 -1.118162
## A4-i192 -4.4041243 38.028858
## MS1 -19.5494165 -12.836320
## MS2 -1.9063188 -6.075283
## MS3 -9.8321059 11.217724
## MS4 0.9127103 1.154627
## MS5 5.6552254 -27.415676
## MS6 26.4594518 26.857981
# "tsne" - t-Stochastic Neighbor Embedding
repOverlapAnalysis(imm_ov1, "tsne")
## DimI DimII
## A2-i129 -309.308638 251.22417
## A2-i131 551.123363 -101.15299
## A2-i133 62.186836 -100.85813
## A2-i132 -59.468871 -175.06698
## A4-i191 -26.853840 -92.15216
## A4-i192 -330.025043 335.14260
## MS1 -235.802997 299.91630
## MS2 27.022606 -182.46162
## MS3 -295.471354 285.13559
## MS4 7.473276 -148.13049
## MS5 555.850331 -135.31223
## MS6 53.274330 -236.28406
## attr(,"class")
## [1] "matrix" "immunr_tsne"
# Visualise the results
vis(repOverlapAnalysis(imm_ov1, "mds"))
# Apply different analysis algorithms to the matrix of public clonotypes:
# "mds" - Multi-dimensional Scaling
repOverlapAnalysis(imm_ov1, "mds")
## Standard deviations (1, .., p=4):
## [1] 0 0 0 0
##
## Rotation (n x k) = (12 x 2):
## [,1] [,2]
## A2-i129 -20.2308709 22.431389
## A2-i131 8.3055445 -26.779321
## A2-i133 45.9341813 -6.893304
## A2-i132 -55.0903957 -18.572513
## A4-i191 23.7461189 -1.118162
## A4-i192 -4.4041243 38.028858
## MS1 -19.5494165 -12.836320
## MS2 -1.9063188 -6.075283
## MS3 -9.8321059 11.217724
## MS4 0.9127103 1.154627
## MS5 5.6552254 -27.415676
## MS6 26.4594518 26.857981
# "tsne" - t-Stochastic Neighbor Embedding
repOverlapAnalysis(imm_ov1, "tsne")
## DimI DimII
## A2-i129 -146.02757 -114.517600
## A2-i131 50.42299 264.605027
## A2-i133 90.16784 -12.674060
## A2-i132 86.90945 -26.117080
## A4-i191 73.07017 -25.917212
## A4-i192 -135.88332 -105.975243
## MS1 -150.84908 -102.155023
## MS2 76.93567 -13.249442
## MS3 -144.63938 -109.063342
## MS4 80.64141 -18.029681
## MS5 47.83455 269.211192
## MS6 71.41727 -6.117536
## attr(,"class")
## [1] "matrix" "immunr_tsne"
# Visualise the results
vis(repOverlapAnalysis(imm_ov1, "mds"))
# Clusterise the MDS resulting components using K-means
vis(repOverlapAnalysis(imm_ov1, "mds+kmeans"))
In order to build a massive table with all clonotypes from the list of repertoires use the pubRep
function.
# Pass "nt" as the second parameter to build the public repertoire table using CDR3 nucleotide sequences
pr.nt = pubRep(immdata$data, "nt", .verbose = F)
## Warning in max(res[["Samples"]]): no non-missing arguments to max;
## returning -Inf
pr.nt
## Source: local data table [0 x 14]
##
## # A tibble: 0 x 14
## # … with 14 variables: CDR3.nt <chr>, Samples <dbl>, `A2-i129` <dbl>,
## # `A2-i131` <dbl>, `A2-i133` <dbl>, `A2-i132` <dbl>, `A4-i191` <dbl>,
## # `A4-i192` <dbl>, MS1 <dbl>, MS2 <dbl>, MS3 <dbl>, MS4 <dbl>,
## # MS5 <dbl>, MS6 <dbl>
# Pass "aa+v" as the second parameter to build the public repertoire table using CDR3 aminoacid sequences and V alleles
# In order to use only CDR3 aminoacid sequences, just pass "aa"
pr.aav = pubRep(immdata$data, "aa+v", .verbose = F)
pr.aav
## Source: local data table [86,185 x 15]
##
## # A tibble: 86,185 x 15
## CDR3.aa V.name Samples `A2-i129` `A2-i131` `A2-i133` `A2-i132` `A4-i191`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CASSLE… TRBV5… 8 2 NA 3 1 NA
## 2 CASSDS… TRBV6… 6 1 1 2 NA 5
## 3 CASSDS… TRBV6… 6 NA NA NA 4 1
## 4 CASSFQ… TRBV5… 6 3 NA 1 1 NA
## 5 CASSLG… TRBV1… 6 2 NA NA 4 3
## 6 CASSLQ… TRBV1… 6 NA 1 1 4 NA
## 7 CSARLA… TRBV2… 6 1 3 NA 2 1
## 8 CASSDS… TRBV6… 5 NA NA NA 3 NA
## 9 CASSFG… TRBV1… 5 NA 1 NA 5 NA
## 10 CASSLD… TRBV5… 5 1 NA NA NA NA
## # … with 86,175 more rows, and 7 more variables: `A4-i192` <dbl>,
## # MS1 <dbl>, MS2 <dbl>, MS3 <dbl>, MS4 <dbl>, MS5 <dbl>, MS6 <dbl>
# You can also pass the ".coding" parameter to filter out all noncoding sequences first:
pr.aav.cod = pubRep(immdata$data, "aa+v", .coding=T)
# Create a public repertoire with coding-only sequences using both CDR3 amino acid sequences and V genes
pr = pubRep(immdata$data, "aa+v", .coding = T, .verbose = F)
# Apply the filter subroutine to leave clonotypes presented only in healthy individuals
pr1 = pubRepFilter(pr, immdata$meta, c(Status = "C"))
# Apply the filter subroutine to leave clonotypes presented only in diseased individuals
pr2 = pubRepFilter(pr, immdata$meta, c(Status = "MS"))
# Divide one by another
pr3 = pubRepApply(pr1, pr2)
# Plot it
p = ggplot() + geom_jitter(aes(x = "Treatment", y = Result), data=pr3)
p