introduction_to_clonealign.Rmd
clonealign
assigns cells measured using single-cell RNA-seq to their clones of origin using copy number data. This is especially useful when clones are inferred from shallow single-cell DNA-seq, in which case the copy number state of each clone is known, but the precise SNV structure is unknown.
To assign cells to clones, clonealign
makes the assumption that
\[ \text{gene expression} \propto \text{number of gene copies} \]
This is demonstrated in the figure below.
Mathematically we have an \(N \times G\) matrix \(Y\) of raw gene expression counts (from RNA-seq) for \(N\) cells and \(G\) genes, where \(y_{ng}\) is the counts to gene \(g\) in cell \(c\). We also have a \(G \times C\) matrix \(\Lambda\) of copy number variation for \(C\) clones, where \(\lambda_{gc}\) is the copy number of gene \(g\) in clone \(c\). We introduce a clone-assigning categorical variable \(\pi_n\) for each cell, where
\[z_n = c \text{ if cell $n$ on clone $c$} \]
then clonealign
models the conditional expected counts in a gene and cell as
\[ E[y_{ng} | z_n=c] = \frac{\lambda_{g,c} f(\mu_g) e^{\psi_n \cdot w_g}}{ \sum_{g'}\lambda_{g',c} f(\mu_{g'}) e^{\psi_n \cdot w_{g'}}} s_n \]
where \(s_n\) is a cell-specific size factor, \(\mu_g\) is the per-chromosome expression (normalized so that \(\mu_1 = 1\) for model identifiability), \(f\) is a function that maps copy number to a multiplicative factor of expression, and \(\psi\) and \(w\) are cell and gene specific random effects. The noise distribution is assumed to be negative-binomial. Inference is performed using reparametrization-gradient variational inference to calculate a posterior distribution of the clone assignments \(z_n\) and of all other model parameters.
clonealign
is built upon Google’s Tensorflow using the Tensorflow R package provided by Rstudio. To install tensorflow
, run
install.packages("tensorflow")
library(tensorflow)
install_tensorflow()
You can confirm the installation succeeded by running
sess = tf$Session()
hello <- tf$constant('Hello, TensorFlow!')
sess$run(hello)
For more details see the Rstudio page on tensorflow installation.
clonealign
can then be installed using the devtools
package via
devtools::install_github("kieranrcampbell/clonealign")
By default, clonealign
requires two inputs:
SingleCellExperiment
, SummarizedExperiment
or cell by gene matrix
data.frame
, DataFrame
or matrix
Bundled with the package is an example SingleCellExperiment
for 100 genes and 200 cells:
library(clonealign)
data(example_sce)
example_sce
#> class: SingleCellExperiment
#> dim: 100 200
#> metadata(0):
#> assays(1): counts
#> rownames(100): gene_1 gene_2 ... gene_99 gene_100
#> rowData names(3): A B C
#> colnames(200): cell_1 cell_2 ... cell_199 cell_200
#> colData names(0):
#> reducedDimNames(0):
#> spikeNames(0):
This has raw integer counts in the assays
slot as required for input to clonealign
:
assay(example_sce, "counts")[1:5, 1:5]
#> cell_1 cell_2 cell_3 cell_4 cell_5
#> gene_1 0 0 0 0 0
#> gene_2 0 0 0 0 0
#> gene_3 1 1 1 0 0
#> gene_4 0 0 0 0 0
#> gene_5 1 0 0 1 0
The CNV data is stored in the rowData
of the SingleCellExperiment
for 3 clones (A, B, and C) and crucially the same genes as the expression data:
cnv_data <- rowData(example_sce)[, c("A", "B", "C")]
stopifnot(nrow(cnv_data) == nrow(example_sce)) # Make sure genes match up
head(cnv_data)
#> DataFrame with 6 rows and 3 columns
#> A B C
#> <integer> <integer> <integer>
#> 1 1 2 2
#> 2 2 1 1
#> 3 3 2 2
#> 4 3 2 2
#> 5 2 2 3
#> 6 2 1 1
The model is fitted with a basic call to clonealign
, which prints the ELBO for each iteration (this can be turned off by setting verbose = FALSE
):
cal <- clonealign(example_sce, cnv_data)
#> Removing 0 genes with low counts
#> Creating Tensorflow graph...
#> clonealign inference complete
print(cal)
#> A clonealign_fit for 200 cells, 100 genes, and 3 clones
#> To access clone assignments, call x$clone
#> To access ML parameter estimates, call x$ml_params
We can plot the ELBO to ensure convergence:
qplot(seq_along(cal$elbo), cal$elbo, geom = c("point", "line")) +
labs(x = "Iteration", y = "ELBO")
The maximum likelihood estimates of the clone assignments can be access through the clone
slot:
clones <- cal$clone
table(clones)
#> clones
#> A B C
#> 184 7 9
This can easily be added to the SingleCellExperiment
for visualization with scater
:
example_sce$clone <- clones
example_sce <- normalise(example_sce)
#> Warning in .local(object, ...): using library sizes as size factors
plotPCA(example_sce, ncomponents = 3, colour_by = "clone")
The clone assignments in clones
can then be used for the desired downstream analysis, such as differential expression or SNV analysis.
The plot_clonealign
function can be used to check the sanity of the fitted clones by ensuring that gene expression does correlate to the inferred copy number. For this we require the input SingleCellExperiment
, copy number matrix, and clone assignments. Note that the SingleCellExperiment
requires columns in rowData
corresponding to the chromosome, start and end position of each feature (gene). These can conveniently be gathered using the getBMFeatureAnnos
function in scater
, e.g.
sce <- getBMFeatureAnnos(sce, filters = "ensembl_gene_id",
attributes = c("ensembl_gene_id", "start_position", "end_position"),
feature_symbol = "hgnc_symbol",
feature_id = "ensembl_gene_id",
dataset = "hsapiens_gene_ensembl")
For now we’ll set these to made-up values:
gene_position <- as_data_frame(cnv_data) %>%
mutate(gene = seq_len(nrow(cnv_data))) %>%
arrange(A, B, C) %>%
mutate(position = seq_len(nrow(cnv_data))) %>%
arrange(gene) %>%
.$position
#> Warning in as.data.frame(x, row.names = NULL, optional = optional, ...):
#> Arguments in '...' ignored
rowData(example_sce)$chromosome <- "1"
rowData(example_sce)$start_pos <- gene_position
rowData(example_sce)$end_pos <- gene_position
We can then plot the expression estimates using the plot_clonealign
function:
plot_clonealign(example_sce, cal$clone, cnv_data,
chromosome = "1",
chr_str = "chromosome",
start_str = "start_pos",
end_str = "end_pos")
#> Joining, by = "ensembl_gene_id"
#> Joining, by = c("clone", "state")
#> Joining, by = c("state", "clone")
where the *_str
identifies the columns of rowData(example_sce)
to look for the chromosome names and feature start and end positions.
clonealign
will assign single-cell RNA-seq to clones no matter how good the fit and no matter whether the clones actually exist in the expression data. As a consequence, it is important to evaluate the quality of fit which in practice we do by fitting the clones on a subset of genes and looking at how well the predicted expression profiles fits the observed expression data of the “held-out” genes.
We perform his out-of-sample evaluation using the evaluate_clonealign
function, which takes in the clonealign fit on the full dataset (along with the dataset) and both prints various metrics and returns a list containing various computed objects:
ec <- evaluate_clonealign(example_sce, cnv_data, cal)
#> On the full dataset, the observed MSE was on average 1.01506017179589 times smaller than under a null model and smaller 100% of the time
#> Fitting reduced clonealign model...
#> Agreement between original (rows) and reduced (columns):
#> A B C
#> A 105 0 0
#> B 78 7 4
#> C 1 0 5
#> On the held-out dataset, the observed MSE was on average 1.02471611655892 times smaller than under a null model and smaller 100% of the time
The function call prints several important metrics:
These metrics are stored in a list returned by the function call. For details, see the help page at ?evaluate_clonealign
.
Inference is performed using reparametrization gradient variational inference which uses the evidence lower bound (ELBO) to monitor convergence. This is controlled using the rel_tol
parameter. When the difference
\[ \Delta ELBO = \frac{ELBO_{\text{new}} - ELBO_{\text{old}}}{|ELBO_{\text{old}}|} \] falls below rel_tol
, the optimization algorithm is considered converged. The maximum number of iterations to acheive this is set using the max_iter
parameter.
Inference is performed using Adam (Kingma and Ba (2014)) which is controlled by three parameters:
learning_rate
the learning raterel_tol
the relative difference in the ELBO below which the optimization will be considered converged (see above)max_iter
the maximum number of Adam iterations to perform (see above)The object returned by a call to clonealign
contains a clone
slot for the maximum likelihood (ML) clone assignment for each cell. The ML estimates of the other parameters can be found in the cal$ml_params
slot:
names(cal$ml_params)
#> [1] "clone_probs" "mu" "s" "alpha" "a"
#> [6] "b" "psi" "W" "chi"
The slot clone_probs
gives the probability that each cell is assigned to each clone:
head(cal$ml_params$clone_probs)
#> A B C
#> [1,] 0.97769814 0.020225340 0.002076519
#> [2,] 0.02084794 0.480126307 0.499025749
#> [3,] 0.98862077 0.007752682 0.003626547
#> [4,] 0.95802716 0.039739091 0.002233748
#> [5,] 0.05501944 0.253582424 0.691398136
#> [6,] 0.95206569 0.044654717 0.003279597
while mu
and phi
give the maximum likelihood estimates of the \(\mu\) and \(\phi\) parameters from the model.
sessionInfo()
#> R version 3.5.0 (2018-04-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 9 (stretch)
#>
#> Matrix products: default
#> BLAS: /usr/lib/openblas-base/libblas.so.3
#> LAPACK: /usr/lib/libopenblasp-r0.2.19.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] bindrcpp_0.2.2 clonealign_0.99.0
#> [3] dplyr_0.7.5 scater_1.8.0
#> [5] ggplot2_3.0.0 SingleCellExperiment_1.2.0
#> [7] SummarizedExperiment_1.10.1 DelayedArray_0.6.1
#> [9] BiocParallel_1.14.1 matrixStats_0.53.1
#> [11] Biobase_2.40.0 GenomicRanges_1.32.3
#> [13] GenomeInfoDb_1.16.0 IRanges_2.14.10
#> [15] S4Vectors_0.18.3 BiocGenerics_0.26.0
#> [17] BiocStyle_2.8.2
#>
#> loaded via a namespace (and not attached):
#> [1] bitops_1.0-6 fs_1.2.2
#> [3] RColorBrewer_1.1-2 progress_1.2.0
#> [5] rprojroot_1.3-2 tools_3.5.0
#> [7] backports_1.1.2 R6_2.2.2
#> [9] vipor_0.4.5 lazyeval_0.2.1
#> [11] colorspace_1.3-2 withr_2.1.2
#> [13] prettyunits_1.0.2 tidyselect_0.2.4
#> [15] gridExtra_2.3 compiler_3.5.0
#> [17] xml2_1.2.0 desc_1.2.0
#> [19] labeling_0.3 bookdown_0.7
#> [21] scales_1.0.0 tfruns_1.4
#> [23] pkgdown_1.0.0 commonmark_1.5
#> [25] stringr_1.3.1 digest_0.6.16
#> [27] rmarkdown_1.10 XVector_0.20.0
#> [29] base64enc_0.1-3 pkgconfig_2.0.1
#> [31] htmltools_0.3.6 limma_3.36.2
#> [33] rlang_0.2.2 rstudioapi_0.7
#> [35] shiny_1.1.0 DelayedMatrixStats_1.2.0
#> [37] bindr_0.1.1 jsonlite_1.5
#> [39] tensorflow_1.5 RCurl_1.95-4.10
#> [41] magrittr_1.5 GenomeInfoDbData_1.1.0
#> [43] Matrix_1.2-14 Rcpp_0.12.19
#> [45] ggbeeswarm_0.6.0 munsell_0.5.0
#> [47] Rhdf5lib_1.2.1 reticulate_1.10
#> [49] viridis_0.5.1 whisker_0.3-2
#> [51] stringi_1.2.2 yaml_2.1.19
#> [53] edgeR_3.22.3 MASS_7.3-50
#> [55] zlibbioc_1.26.0 rhdf5_2.24.0
#> [57] plyr_1.8.4 grid_3.5.0
#> [59] promises_1.0.1 shinydashboard_0.7.0
#> [61] crayon_1.3.4 lattice_0.20-35
#> [63] cowplot_0.9.2 hms_0.4.2
#> [65] locfit_1.5-9.1 knitr_1.20
#> [67] pillar_1.2.3 rjson_0.2.20
#> [69] reshape2_1.4.3 glue_1.2.0
#> [71] evaluate_0.10.1 data.table_1.11.4
#> [73] httpuv_1.4.3 tidyr_0.8.1
#> [75] gtable_0.2.0 purrr_0.2.5
#> [77] assertthat_0.2.0 xfun_0.2
#> [79] mime_0.5 xtable_1.8-2
#> [81] roxygen2_6.0.1 later_0.7.3
#> [83] viridisLite_0.3.0 tibble_1.4.2
#> [85] beeswarm_0.2.3 memoise_1.1.0
#> [87] tximport_1.8.0