NOTE: The old version of this tutorial has been archived. The users can access it from here.

This guide will demonstrate the usage of the Liger package in the style of the R Console, which can be accessed through an R development environment (e.g., RStudio) or directly from the R command line.

Stage I: Preprocessing and Normalization (3 - 5 seconds)

1. For the first portion of this protocol, we will be integrating data from control and interferon-stimulated PBMCs from Kang et al, 2017. The data can be found in the Gene Expression Omnibus, Series GSE96583. This dataset was originally in the form of output from the 10X Cellranger pipeline, though we will directly load downsampled versions of the control and stimulated DGEs here.

For convenience, we have prepared the pre-processed data which are ready to use. There are three datasets: “PBMC_control.RDS” and “PBMC_interferon-stimulated.RDS”, which correspond to control and interferon-stimulated PBMCs individually. The data can be downloaded here.

library(liger)
ctrl_dge <- readRDS("~/Downloads/PBMC_control.RDS");
stim_dge <- readRDS("~/Downloads/PBMC_interferon-stimulated.RDS");

For 10X CellRanger output, we can instead use the read10X function, which generates a matrix or list of matrices directly from the output directories.

matrix_list <- read10X(sample.dirs =c("10x_ctrl_outs", "10x_stim_outs"), sample.names = c("ctrl", "stim"), merge = F);

2. With the digital gene expression matrices for both datasets, we can initialize a Liger object using the createLiger function.

ifnb_liger <- createLiger(list(ctrl = ctrl_dge, stim = stim_dge))

ifnb_liger now contains two datasets in its raw.data slot, ctrl and stim. We can run the rest of the analysis on this Liger object.

3. Before we can run iNMF on our datasets, we must run several preprocessing steps to normalize expression data to account for differences in sequencing depth and efficiency between cells, identify variably expressed genes, and scale the data so that each gene has the same variance. Note that because nonnegative matrix factorization requires positive values, we do not center the data by subtracting the mean. We also do not log transform the data.

ifnb_liger <- normalize(ifnb_liger)
ifnb_liger <- selectGenes(ifnb_liger)
ifnb_liger <- scaleNotCenter(ifnb_liger)

4. We are now able to run integrative non-negative matrix factorization on the normalized and scaled datasets. The key parameter for this analysis is k, the number of matrix factors (analogous to the number of principal components in PCA). In general, we find that a value of k between 20 and 40 is suitable for most analyses and that results are robust for choice of k. Because LIGER is an unsupervised, exploratory approach, there is no single “right” value for k, and in practice, users choose k from a combination of biological prior knowledge and other information.

Stage II: Joint Matrix Factorization (3 - 10 minutes)

ifnb_liger <- optimizeALS(ifnb_liger, k = 20)

Important parameters are as follows:

The optimization yields several lower dimension matrices, including the H matrix of metagene loadings for each cell, the W matrix of shared factor loadings and the V matrices of dataset-specific factor loadings.

Please note that the time required of this step is highly dependent on the size of the datasets being used. In most cases, this step should not take much longer than 30 minutes.

5. We can now use the resulting factors to jointly cluster cells and perform quantile normalization by dataset, factor, and cluster to fully integrate the datasets. All of this functionality is encapsulated within the quantile_norm function, which uses max factor assignment followed by refinement using a k-nearest neighbors graph.

Stage III: Quantile Normalization and Joint Clustering (1 minute)

ifnb_liger <- quantile_norm(ifnb_liger)

Important parameters of quantile_norm are as follows:

6. The quantile_norm procedure produces joint clustering assignments and a low-dimensional representation that integrates the datasets together. These joint clusters directly from iNMF can be used for downstream analyses (see below). Alternatively, you can also run Louvain community detection, an algorithm commonly used for single-cell data, on the normalized cell factors. The Louvain algorithm excels at merging small clusters into broad cell classes and thus may be more desirable in some cases than the maximum factor assignments produced directly by iNMF.

ifnb_liger <- louvainCluster(ifnb_liger, resolution = 0.25)

7. To visualize the clustering of cells graphically, we can project the normalized cell factors to two or three dimensions. Liger supports both t-SNE and UMAP for this purpose. Note that if both techniques are run, the object will only hold the results from the most recent.

Stage IV: Visualization (2 - 3 minutes) and Downstream Analysis (25 - 40 seconds)

ifnb_liger <- runUMAP(ifnb_liger, distance = 'cosine', n_neighbors = 30, min_dist = 0.3)

The liger package implements a variety of utilities for visualization and analysis of clustering, gene expression across datasets, and comparisons of cluster assignments. We will summarize several here.

8. plotByDatasetAndCluster returns two graphs, generated by t-SNE or UMAP in the previous step. The first colors cells by dataset of origin, and the second by cluster as determined by Liger. The plots provide visual confirmation that the datasets are well aligned and the clusters are consistent with the shape of the data as revealed by UMAP.

all.plots <- plotByDatasetAndCluster(ifnb_liger, axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = T)
all.plots[[1]] + all.plots[[2]]

To directly study the impact of factors on the clustering and determine what genes load most highly on each factor, we use the plotGeneLoadings function, which returns plots of factor loading on the dimensionally reduced graphs and highly loaded genes by dataset for each factor.

gene_loadings <- plotGeneLoadings(ifnb_liger, do.spec.plot = FALSE, return.plots = TRUE)
gene_loadings[[4]]

Using the runWilcoxon function, we can next identify gene markers for all clusters. We can also compare expression within each cluster across datasets, which in this case reveals markers of interferon-beta stimulation. The function returns a table of data that allows us to determine the significance of each gene’s differential expression, including log fold change, area under the curve and p-value.

cluster.results <- runWilcoxon(ifnb_liger, compare.method = "clusters")
## [1] "Performing Wilcoxon test on ALL datasets: ctrl, stim"
head(cluster.results)
##         feature group   avgExpr        logFC statistic       auc        pval
## 1 RP11-206L10.2     0 -23.01898 -0.005557170   4259604 0.4998057 0.570575909
## 2 RP11-206L10.9     0 -23.01914 -0.005979592   4259602 0.4998055 0.570110484
## 3     LINC00115     0 -23.00609 -0.031260202   4252898 0.4990189 0.138680541
## 4         NOC2L     0 -21.83167  0.292228474   4340248 0.5092682 0.004585054
## 5        KLHL17     0 -23.00556  0.003658342   4262141 0.5001034 0.819517383
## 6       PLEKHN1     0 -23.01194 -0.039663835   4249915 0.4986689 0.044537037
##         padj pct_in pct_out
## 1 0.69419204    100     100
## 2 0.69419204    100     100
## 3 0.24813295    100     100
## 4 0.01435755    100     100
## 5 0.88363326    100     100
## 6 0.10099905    100     100
datasets.results <- runWilcoxon(ifnb_liger, compare.method = "datasets")
## [1] "Performing Wilcoxon test on ALL datasets: ctrl, stim"
head(datasets.results)
##         feature  group   avgExpr       logFC statistic       auc       pval
## 1 RP11-206L10.2 0-ctrl -23.02585 -0.01408078  665437.5 0.4995560 0.30577296
## 2 RP11-206L10.9 0-ctrl -23.02585 -0.01376717  665437.5 0.4995560 0.30577296
## 3     LINC00115 0-ctrl -22.99983  0.01283157  666564.5 0.5004020 0.59231247
## 4         NOC2L 0-ctrl -21.87312 -0.08498766  662383.5 0.4972633 0.61893894
## 5        KLHL17 0-ctrl -22.98625  0.03960509  667718.0 0.5012680 0.09102028
## 6       PLEKHN1 0-ctrl -23.02585 -0.02853147  664846.0 0.4991119 0.14726281
##        padj pct_in pct_out
## 1 0.6990712    100     100
## 2 0.6990712    100     100
## 3 0.8964793    100     100
## 4 0.9110340    100     100
## 5 0.4580475    100     100
## 6 0.5773302    100     100

The number of significant genes identified by runWilcoxon varies and depends on the datasets used. You can then filter the markers which are statistically and biologically significant. For example, one strategy is to filter the output by taking markers which have padj (Benjamini-Hochberg adjusted p-value) less than 0.05 and logFC (log fold change between observations in group versus out) larger than 3:

cluster.results <- cluster.results[cluster.results$padj < 0.05,]
cluster.results <- cluster.results[cluster.results$logFC > 3,]

You can then re-sort the markers by its padj value in ascending order and choose the top 100 for each cell type. For example, we can subset and re-sort the output for Cluster 3 and take the top 20 markers by typing these commands:

wilcoxon.cluster_3 <- cluster.results[cluster.results$group == 3, ]
wilcoxon.cluster_3 <- wilcoxon.cluster_3[order(wilcoxon.cluster_3$padj), ]
markers <- wilcoxon.cluster_3[1:20, ]
head(markers)
##       feature group    avgExpr     logFC statistic       auc          pval
## 41861    GNLY     3  -5.282466 16.147323   2379942 0.9647642  0.000000e+00
## 46904   CLIC3     3 -12.962070  9.485599   1946458 0.7890417  0.000000e+00
## 47130    PRF1     3 -12.260573  9.830082   1970124 0.7986352  0.000000e+00
## 49239    GZMB     3  -7.840488 13.697178   2231966 0.9047789  0.000000e+00
## 52832    NKG7     3  -6.594620 14.433185   2324832 0.9424243  0.000000e+00
## 48310   KLRC1     3 -18.000059  4.916239   1606731 0.6513252 1.839405e-289
##                padj pct_in pct_out
## 41861  0.000000e+00    100     100
## 46904  0.000000e+00    100     100
## 47130  0.000000e+00    100     100
## 49239  0.000000e+00    100     100
## 52832  0.000000e+00    100     100
## 48310 4.099114e-286    100     100

We can then visualize the expression profiles of individual genes, such as the differentially expressed genes that we just identified. This allows us to visually confirm the cluster- or dataset-specific expression patterns of marker genes. plotGene returns graphs of gene loading on the dimensionally reduced graph for each dataset.

PRF1 <- plotGene(ifnb_liger, "PRF1", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = T)
PRF1[[1]] + PRF1[[2]]

We can also use plotGene to compare the loading of cluster markers within and between datasets.

IFIT3 <- plotGene(ifnb_liger, "IFIT3", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE)
IFITM3 <- plotGene(ifnb_liger, "IFITM3", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE)
plot_grid(IFIT3[[1]],IFIT3[[2]],IFITM3[[1]],IFITM3[[2]], ncol=2)