Open Access additional raw data Single-Cell RNA-Seq Reveals Transcriptional Heterogeneity in Latent and Reactivated HIV-infected Cells Monica Golumbeanu1,2, Sara Cristinelli3, Sylvie Rato3, Miguel Munoz3, Matthias Cavassini4, Niko Beerenwinkel1,2, *, Angela Ciuffi3, * Doi : 10.1016/j.celrep.2018.03.102 EXPERIMENTAL PROCEDURES The accession number for the RNA-seq data reported in the paper and processed as described below is GEO: GSE111727. Computational analysis of bulk and single-cell RNA-Seq data A standard pipeline for preprocessing RNA-Seq data was used to filter and align the sequencing reads from both single-cell and bulk RNA-Seq experiments (as detailed in the next sections below). Based on the Software tools scran (Lun et al., 2016) and scater (McCarthy et al., 2017) applied to the gene expression read counts for each single-cell data set, we designed a customized filtering approach to discard low-quality cells in the scRNA-Seq datasets. The filtering criteria were the total number of reads associated to genes, the proportion of mitochondrial reads, and the proportion of ERCC spike-ins reads (Supplemental Figure S1C). For the data produced with the experimental latency model, single-cell read counts were normalized using linear size factors calculated from the ERCC spike-ins with the package scran (Lun et al., 2016). In absence of ERCC spike-ins, single-cell read counts for the scRNA-Seq data from HIV+ individuals were normalized using a normalization procedure specifically designed for single-cell data without spike-ins measurements, based on computing cell-specific size factors by considering random pools of cells (L. Lun et al., 2016). Normalized read counts were log2-transformed. Principal component analysis (PCA) (Pearson, 1901) was performed on the normalized single-cell read counts from the latency model, followed by k-means clustering in order to identify sub-populations of cells. We validated the two identified cellular subpopulations through a comprehensive study employing other normalization and dimension reduction methods (as detailed in the next sections below). For the cells isolated from HIV+ donors, PCA was performed on the normalized read counts and was followed by model-based clustering of the cells from each donor and treatment condition by using the R package mclust (Scrucca et al., 2016). One outlier cell was removed from the analysis following outlier detection. Differential expression analysis was performed with the software MAST (Finak et al., 2015). Only genes expressed in at least one cell (expression level > 0) per cluster were considered. Following the statistical test implemented in MAST, genes with a Benjamini-Hochberg corrected p-value less than 0.01 were considered differentially expressed. Enrichment analysis was conducted on the identified DE genes using the pathways defined in the Reactome database (Fabregat et al., 2016). For this purpose, a hypergeometric test was performed using a background list consisting of genes expressed at least in one cell and condition and the resulting p-values were corrected with the Benjamini-Hochberg method. Preprocessing analysis pipeline for bulk and scRNA-Seq data Software tools Cutadapt v1.7.1 [1] and Prinseq-lite [2] were used for adapter trimming and read filtering, respectively. Alignment of the sequencing reads was performed against the Ensembl GRCh38 assembly of the human genome concatenated with the HIV viral sequence (Zenodo_Data_S1) using the STAR v2.5 aligner [3]. The following alignment parameters were different from default: --outFilterMismatchNoverLmax 0.1 --seedSearchStartLmax 50. Afterwards, gene expression levels were quantified by counting the number of reads within each gene with HTSeq v0.5.3 [4]. Data processing results for the latency model On average, bulk RNA-Seq produced 48 million sequencing reads per sample, with a 95.6% alignment rate, while scRNA-Seq yielded 3 million sequencing reads per cell, with 93.5% aligned reads (Figure S1A and S1B, Table S1, Zenodo_Data_S2). After filtering of low quality cells, there were 224 remaining cells out of the initial 288 (Figure S1C, Table S1). The correlation between pooled single-cell samples and corresponding bulk RNA-Seq data was also assessed. An average correlation of 0.66 per condition indicated a significantly high agreement between bulk and scRNA-Seq data (Spearman rank correlation test: p<10-15, Figure S1D). Comprehensive analysis of cellular heterogeneity in scRNA-Seq data We explored the cellular heterogeneity in our data with different normalization and dimensionality reduction analysis methods, in order to rule out artifacts due to a specific approach or model assumption. First, we examined whether the observed cellular clusters per condition were a consequence of using a normalization approach based on spike-ins. Toward this end, we applied PCA to gene expression profiles that have been normalized with three other methods, namely Pooling, Relative log expression (RLE) and Trimmed median of M-values (TMM). The first method, Pooling, implemented in the R/Bioconductor package scran [5], is based on computing cell-specific size factors by considering random pools of cells and solving a corresponding system of linear equations [6]. The two other methods, RLE and TMM, have been developed for bulk RNA-Seq data. RLE, implemented in the DESeq package, uses size factors equivalent to the median of gene expression values adjusted by the geometric mean of their expression across cells [7]. TMM, implemented in the package edgeR, calculates the log2 fold changes, named M-values, between a “reference” cell and the other cells, removes 30% of their distribution confines and uses their weighted average to compute size factors [8]. Following the different normalization approaches, PCA and clustering led to the same heterogeneity patterns in the data, namely two clusters per condition (Figure S2A). An average rand index of 0.94 indicated that the resulting clustering solutions in each case were equivalent (Figure S2B). Due to the small amount of RNA material handled during the experiments, scRNA-Seq data contains a large number of genes with zero expression, called dropout events, which have been shown to influence PCA results [9]. The data analyzed in this work presents an effect of increased transcriptional expression in SAHA- and TCR-treated cells as opposed to untreated cells where the dropout rates are much higher (Figure S3A). In order to evaluate the effect of cellular expression depth on the PCA and subsequent clustering results, we also used several alternative dimensionality reduction methods, namely ZIFA [9], CIDR [10] and CMF [11], designed to correct for dropout events. ZIFA is based on a zero-inflated factor model which represents the dropout rate for a gene based on its expected expression level, CIDR uses a principal coordinate analysis with imputation approach, and CMF models single-cell data with a zero-inflated Gamma-Poisson factor model which accounts for dropout events. All the applied dimensionality reduction methods led to configurations equivalent to the cellular subpopulations previously identified with PCA (Figure S2C), with and average agreement of 93.49% between the clustering solutions (Figure S2D). Data processing results for the cells isolated from HIV+ individuals On average, for samples from HIV+ donors, scRNA-Seq yielded 4.8 million sequencing reads per cell, with 49.4% uniquely aligned reads (Figure S1E, Table S1, Zenodo_Data_S3). After filtering of low quality cells, there were 77 remaining cells out of the initial 96 (Figure S1F, Table S1). The scRNA-Seq datasets for the HIV+ donors present a high proportion of unmapped reads, with an average of 38% unmapped reads per cell. This effect is often observed in single-cell RNA-Seq data and may be due to several circumstances such as poor capture of mRNA, amplification errors, or cell damage [12]. Furthermore, following adapter trimming, reads can become too short to be accurately mapped to the reference genome, which also results in an increase in unmapped reads. We used stringent quality control thresholds for read alignment, and quantified expression using only high quality mapped reads. After alignment, despite the high amount of resulting unmapped reads, we obtained an average of 3.6 million reads per cell, which is considered to be sufficient for quantifying gene expression. Studies have shown that with 1 million reads, the majority of expressed genes can be reliably quantified [13-16]. Furthermore, despite the significant proportion of unmapped reads, by assuming uniform sampling of reads, we expect the qualitative expression features within cells to be conserved, i.e., low versus high expression of genes in groups of cells. We conclude that our qualitative validation of the existence of two clusters of cells in the data from HIV+ individuals is a reliable result not confounded by the high proportions of unmapped reads. SUPPLEMENTAL REFERENCES 1. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal. 2011;17. 2. Schmieder, R. and R. Edwards, Quality control and preprocessing of metagenomic datasets. Bioinformatics, 2011. 27. 3. Dobin, A., et al., STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 2013. 29(1): p. 15-21. 4. Anders, S., P.T. Pyl, and W. Huber, HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics, 2015. 31(2): p. 166-169. 5. Lun, A.T.L., D.J. McCarthy, and J.C. Marioni, A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Research, 2016. 5: p. 2122. 6. L. Lun, A.T., K. Bach, and J.C. Marioni, Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biology, 2016. 17(1): p. 1-14. 7. Anders, S. and W. Huber, Differential expression analysis for sequence count data. Genome Biology, 2010. 11(10): p. 1-12. 8. Robinson, M.D., D.J. McCarthy, and G.K. Smyth, edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 2010. 26. 9. Pierson, E. and C. Yau, ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biology, 2015. 16(1): p. 241. 10. Lin, P., M. Troup, and J.W.K. Ho, CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biology, 2017. 18(1): p. 59. 11. Durif, G., et al., Probabilistic Count Matrix Factorization for Single Cell Expression Data Analysis. ArXiv e-prints, 2017. 12. Rizzetto, S., et al., Impact of sequencing depth and read length on single cell RNA sequencing data of T cells. Scientific Reports, 2017. 7(1): p. 12781. 13. Wu, A.R., et al., Quantitative assessment of single-cell RNA-sequencing methods. Nature methods, 2014. 11(1): p. 41-46. 14. Pollen, A.A., et al., Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nature biotechnology, 2014. 32(10): p. 1053-1058. 15. Shalek, A.K., et al., Single cell RNA Seq reveals dynamic paracrine control of cellular variation. Nature, 2014. 510(7505): p. 363-369. 16. Bacher, R. and C. Kendziorski, Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biology, 2016. 17(1): p. 1-14. Attached files * Zenodo_Data_S1: Genomic sequence of the HIVGFP/VSV-G virus used to infect cells in the cellular latency experimental model. * Zenodo_Data_S2: Normalized expression values for the HIV latency model single-cell RNA-Seq data. The data is stored in an rdata library accessible with software R. * Zenodo_Data_S3: Filtered and normalized expression values for the HIV single-cell RNA-Seq data from HIV+ individuals. The data is stored in an rdata library accessible with software R.