This vignette shows how to jointly define clusters using single-cell RNA-seq and single-nuclear ATAC-seq data. We will use the peripheral blood mononuclear cell (PBMC) scRNA-seq and snATAC-seq datasets provided by 10X. The workflow for jointly analyzing RNA and ATAC is quite similar to that for integrating multiple RNA datasets. The only differences are (1) we need to process the ATAC data into gene-level values comparable to gene expression and (2) we perform gene selection using the RNA data only.
Because ATAC-seq measures chromatin accessibility, a different data modality than gene expression, we need to process the ATAC data to obtain gene-level features that we can use to integrate with the RNA data. Although one could imagine many strategies for calculating gene-level features from ATAC data, we found that the simplest possible strategy works well: counting the number of reads that overlap the gene body and promoter for each gene. We show how to do this below starting from the cellranger
output, but for now we will simply read in the final gene x cell matrices for RNA and ATAC data. These matrices and all other inputs are provided [here] (https://umich.box.com/s/5hkhpou4inrulo570yhc38ay8g3ont5b), so we can load them directly:
rna_clusts = readRDS("rna_cluster_assignments.RDS")
atac_clusts = readRDS("atac_cluster_assignments.RDS")
pbmc.atac <- readRDS('pbmc.atac.expression.mat.RDS')
pbmc.rna <- readRDS('pbmc.rna.expression.mat.RDS')
Here we show how to compute a gene x cell feature matrix starting from the fragments.tsv
file output by the 10X cellranger count
pipeline. SKip this section for now if you want to simply try out liger
on the provided counts.
Although one could imagine many strategies for calculating gene-level features from ATAC data, we found that the simplest possible strategy works well: counting the number of reads that overlap the gene body and promoter for each gene. We found that summing the peak counts output by cellranger count
for the peaks overlapping each gene can also work, but this strategy is less desirable because (1) information from reads not in peaks is lost and (2) the cellranger
peak calling is performed on all cells, which leads to an overrepresentation of peaks from abundant cell populations and biases against rare cell populations.
We sort the fragments output by cellranger
by barcode and use the BEDOPS bedmap
tool to make a list of cell barcodes that overlap each gene and promoter. To perform this step, you need to install BEDOPS (see this page for installation instructions). The file atac_fragments.tsv
is available from the 10X website at [this link] (http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_v1_pbmc_10k/atac_v1_pbmc_10k_fragments.tsv.gz). Note that this file is very large and the step to sort the file is somewhat slow.
system("sort -k4,4 atac_v1_pbmc_10k_fragments.tsv > atac_fragments.sort.bed")
system("bedmap --delim \"\\t\" --echo --echo-map-id hg19_genes.bed atac_fragments.sort.bed > atac_genes_bc.bed")
system("bedmap --delim \"\\t\" --echo --echo-map-id hg19_promoters.bed atac_fragments.sort.bed > atac_promoters_bc.bed")
We then import the bedmap output into R. Note that we use the as.is
parameter in read.table
to supress the conversion of character columns like gene symbol to factors. Using Liger’s makeFeatureMatrix
function, we calculate gene and promoter accessibility counts. Finally, we sort the rows of these matrices by gene symbol and add them together. Note that we could also calculate these combined counts in a single step by combining the gene and promoter BED files; we choose to calculate them separately in case this is useful for downstream analyses.
library(liger)
genes.bc <- read.table(file = "atac_genes_bc.bed", sep = "\t", as.is = c(4,7), header = FALSE)
promoters.bc <- read.table(file = "atac_promoters_bc.bed", sep = "\t", as.is = c(4,7), header = FALSE)
#The makeFeatureMatrix function requires a list of barcodes as a vector
barcodes = names(atac_clusts)
gene.counts <- makeFeatureMatrix(genes.bc, barcodes)
promoter.counts <- makeFeatureMatrix(promoters.bc, barcodes)
gene.counts <- gene.counts[order(rownames(gene.counts)),]
promoter.counts <- promoter.counts[order(rownames(promoter.counts)),]
pbmc.atac <- gene.counts + promoter.counts
Liger’s read10X
function can be used to read cellranger count
output files into R.
sample.dir = '.'
sample.name = 'pbmc.rna'
pbmc.rna <- read10X(sample.dirs = list(sample.dir), sample.names = list(sample.name))
The algorithm takes a list of two or more count matrices as input. Genes should be in rows and cells in columns. We can make a list of the count matrices and pass this as parameter to the createLiger
function to construct a Liger object.
library(liger)
ggplot2::theme_set(theme_cowplot())
pbmc.data = list(atac=pbmc.atac[,names(atac_clusts)], rna=pbmc.rna[,names(rna_clusts)])
int.pbmc <- createLiger(pbmc.data)
Before computing the factorization, we need to normalize the data to account for different capture efficiency and sequencing depth per cell with normalize
, select variable genes with selectGenes
, and scale the data with scaleNotCenter
. Note that we do not mean-center the data because nonnegative matrix factorization requires nonnegative values.
The selectGenes
function performs variable gene selection on each of the datasets separately, then takes the union of the result. The variable genes are selected by comparing the variance of each gene’s expression to its mean expression. The selectGenes
function was written primarily scRNA-seq in mind, and snATAC-seq data distribution is quite different. So instead of taking the union of variable genes from RNA and ATAC, we set datasets.use = 2
in the function to perform gene selection using only the RNA dataset.
int.pbmc <- normalize(int.pbmc)
int.pbmc <- selectGenes(int.pbmc, datasets.use = 2)
int.pbmc <- scaleNotCenter(int.pbmc)
Next we perform integrative non-negative matrix factorization in order to identify shared and distinct metagenes across the datasets and the corresponding metagene loadings for each cell. The most important parameters in the factorization are k
(the number of factors) and lambda
(the penalty parameter which limits the dataset-specific component of the factorization). The default value of lambda=5.0
usually provides reasonable results for most analyses. For this analysis, we simply use k = 20
and the default value of lambda
.
int.pbmc <- optimizeALS(int.pbmc, k=20)
After the factorization, we still need to quantile normalize the factor loadings across the datasets. Notice that if we plot a t-SNE representation of the factor loadings before normalization, the cells still cluster mainly by modality.
int.pbmc <- runTSNE(int.pbmc, use.raw = T)
p1 <- plotByDatasetAndCluster(int.pbmc, return.plots = T)
print(p1[[1]])
To better integrate the datasets, we perform a quantile normalization step. This process first identifies similarly loading cells across datasets by building a similarity graph based on shared factor neighborhoods. Using Louvain community detection, we then jointly identify clusters across datasets, and normalize quantiles within each cluster and factor. The key parameters in this step are resolution
(increasing this increases the number of communities detected) and knn_k
(the number of dataset neighbors used in generating the shared factor neighborhood). In general, lowering knn_k
will allow for more fine-grained identification of smaller groups, but possibly at the cost of decreased alignment across datasets. Here we simply use default settings of resolution=1.0
and knn_k=20
.
int.pbmc <- quantileAlignSNF(int.pbmc)
Now we can visualize the integrated data.
int.pbmc <- runTSNE(int.pbmc)
plots <- plotByDatasetAndCluster(int.pbmc, return.plots = T,clusters=rna_clusts)
print(plots[[1]])
To see how the aligned space preserves the structure from the RNA data, we can plot the previously annotated cluster assignments for the RNA profiles:
print(plots[[2]])
Plotting the expression and accessibility for NK and B-cell markers indicates that these cells are properly aligned.
NKG7 = plotGene(int.pbmc,gene="NKG7",return.plots=T)
MS4A1 = plotGene(int.pbmc,gene="MS4A1",return.plots=T)
plot_grid(NKG7[[2]],MS4A1[[2]],NKG7[[1]],MS4A1[[1]],ncol=2)