Comprehensive evaluation of transcriptome-based cell-type quantification methods for immuno-oncology


	2 Results



	2.1 Benchmark on simulated and true bulk RNA-seq samples reveals differences in method performance between cell types

We created 100 simulated bulk RNA-seq samples with known cell type proportions using the single cell dataset. For each sample, we individually retrieved and aggregated 500 random immune- and non-immune cells. This approach has been established and successfully applied bySchelker et al. (2017) for benchmarking CIBERSORT. Additional consistency checks support that simulated bulk RNA-seq data are not subject to systematic biases (Supplementary Figs S1–S4).
We applied the seven methods to these samples and compared the estimated to the known fractions. The results are shown inFigure 1a. All methods obtained a high correlation on B cells (Pearson’s r  >  0.71), cancer associated fibroblasts (r  > 0.72), endothelial cells (r  >  0.94) and CD8+ T cells (r >  0.76). Most methods obtain a correlation of r >  0.68 on macrophages/monocytes, NK cells and overall CD4+ T cells. We observed poor performance in distinguishing regulatory from non-regulatory CD4+ T cells and in estimating dendritic cells (DCs).

We validated the results using three independent datasets, which provide both bulk RNA-seq data and a gold-standard estimate of immune cell fractions using FACS (Fig. 1b–d andSupplementary Fig.S5). In general, the results are highly consistent with the mixing benchmark, with the exception that the methods’ performance on CD8+ T cells is worse on the validation data considered altogether (Fig. 1c). This can be attributed to the fact that the variance of CD8+ T cell contents between the samples is low and the correlations are therefore not meaningful (seeFig. 1b andSupplementary Fig.S5). Moreover, in the simulation benchmark, only TIMER detects DCs, while in Hoek’s dataset (Hoek et al., 2015), all methods except CIBERSORT detect a signal. This inconsistency and the poor performance on DCs in general might be indicative of the still not well defined biological heterogeneity between DC subtypes.


	2.2 Background predictions widespread among deconvolution-based methods

Next we investigated two related questions: at which abundance level do methods reliably identify the presence of immune cells (minimal detection fraction) and, conversely, what fractions of a certain cell type are predicted even when they are actually absent (background predictions). To this end, we took advantage of the fact that single cell-based simulations allow us to create artificial bulk RNA-seq samples of arbitrary compositions. For each cell type, we created samples by spiking-in an increasing amount of the cell type of interest into a background of 1000 cells randomly sampled from all other cell types. We measured the background prediction level as the predicted score on the background only, i.e. no spike-in cells added. We defined the minimal detection fraction as the minimal number of spike-in cells needed for the score to be significantly different from the background.
We observed that, in most cases, the deconvolution-based approaches predict a minimal amount of immune cells even though they are absent (Fig. 2). Yet, background predictions are low for CAFs and NK cells for EPIC and non-regulatory CD4+ T cells and NK cells for quanTIseq. Also, TIMER does not suffer from background predictions (except for DCs) at the cost of a highly elevated minimal detection fraction. In contrast, xCell, which uses a statistical test for enrichment of the marker-genes, is highly robust against background-predictions (score = 0 for all tested cell types) at a slightly elevated minimal detection fraction (<5% infiltration for most cell types). Unfortunately, testing MCP-counter for its detection limit and background-predictions is not straightforward, as the method does not compute a score, but uses raw gene expression values. To make a statement about the absence of a cell type, a platform-specific null-model needs to be generated, which, in fact, is already provided by the authors for some microarray platforms, but not for RNA-seq. The authors also addressed the detection limit on RNA mixtures in their original publication (Becht et al., 2016). In short, we observe that deconvolution-based approaches are susceptible to background-predictions that might be due to the similarity of the signatures of closely related cell types (multicollinearity) and/or to a lower cell-specificity of signature/marker genes.



	2.3 Spillover can be attributed to non-specific signature genes

Motivated by the hypothesis that background predictions might be due to non-specific signature genes, we asked which cell types lead to methods erroneously predicting a higher abundance of another. We refer to this effect as ‘spillover’. We assessed spillover effects using simulated bulk RNA-seq samples containing single immune cell types (Fig. 3) and validated the results using true bulk RNA-seq samples of FACS-purified immune cells (Supplementary Fig.S6).

In Section 2.2, we observed, for instance, that quanTIseq is affected by a high background prediction level for macrophages/monocytes (Fig. 2). In the spillover analysis inFigure 3, we note that quanTIseq predicts a pure CAF sample to contain a considerable amount of macrophages/monocytes. We, therefore, suspect that the high background prediction level is driven by non-specific marker genes in the quanTIseq signature matrix. Indeed, we identified five genes, CXCL2, ICAM1, PLTP, SERPING1 and CXCL3 that are expressed in both CAFs and Macrophages/Monocytes. After removing these genes from the matrix, the background prediction level is significantly reduced by 27% (Fig. 4a).

Beyond, for all methods, we consistently observe spillover between CD8+ and CD4+ T cells, between NK cells and CD8+ T cells and from DCs to B cells. The former two spillover effects are conserved in the validation dataset (Supplementary Fig.S6) supporting that the effects are not due to misannotated cells in the single cell dataset.
The spillover between DCs and B cells could not be confirmed in the validation dataset, however, we demonstrate that in the single cell dataset, the B cell and DC clusters are both distinct and well-annotated (Supplementary Fig.S7). Proceeding analogously to the CAF/macrophage spillover above, we identified six genes, TCL1A, TCF4, CD37, SPIB, IRF8 and BCL11A, that are expressed both in the B cell and DC subpopulations on the mRNA level. Indeed, in the LifeMap Discovery database (Edgar et al., 2013), all six genes are annotated as being expressed in both B cells and plasmacytoid DCs (pDCs). When excluding these genes from the deconvolution matrices, the spillover effect is substantially reduced by 55% for EPIC and 78% for quanTIseq (Fig. 4b). Given that the single cell dataset contains pDCs, quanTIseq has been trained on myeloid DCs (mDCs) and EPIC does not consider any kind of DCs (Supplementary Table S2), it is not surprising that both methods view these genes as B cell markers and are not able to discriminate between pDCs and B cells.


	2.4 Guidelines for method selection

InTable 2, we provide guidelines on which method to use for which cell type based on three criteria: interpretability of the score, overall performance and possible limitations. It is very important to understand the implications of the different scoring strategies. EPIC and quanTIseq are the only methods providing an ‘absolute score’ that represents a cell fraction. All other methods provide scores in arbitrary units that are only meaningful in relation to another sample of the same dataset. For this reason, and due to a robust overall performance, we recommend EPIC and quanTIseq for general purpose deconvolution. In practice, absolute scores are not always necessary. For instance, in a clinical trial, relative scoring methods can be used to infer fold changes between treatment and control group or to monitor changes in the immune composition in longitudinal samples. In that case, using MCP-counter is a good choice thanks to its highly specific signatures that excelled in the spillover benchmark. A limitation of deconvolution methods is that they are susceptible to background predictions, i.e. prediction of (small) fractions for cell types that are actually absent. Therefore, when interested in presence/absence of a cell type, we suggest using xCell.


	2.5 Immunodeconv R package provides easy access to deconvolution methods

We created an R package, immunedeconv, that provides a unified interface to the different deconvolution methods. The package is freely available from GitHub: https://github.com/grst/immunedeconv. In a separate repository ( https://github.com/grst/immune_deconvolution_benchmark), we provide a snakemake pipeline (Koster and Rahmann, 2012) that allows to fully reproduce this benchmark and includes step-by-step instructions on how to test additional methods.


	5 Materials and Methods



	5.1 Single-cell data

The dataset of 11 759 single cells fromSchelker et al. (2017) was obtained through their figshare repository ( https://figshare.com/s/711d3fb2bd3288c8483a). We reproduced their analysis using their MATLAB source code and exported the final dataset to continue our analysis in R. We excluded all cells that were classified as ‘Unknown’ from the downstream analysis. Moreover, we did not consider cells originating from Peripheral blood mononuclear cells (PBMC) as we were interested in the methods’ performance on cancer samples. Gene expression values are expressed as transcripts per million (TPM).


	5.2 Immune-cell reference data

Immune cell reference profiles were obtained from the sources described inSupplementary Table S3. FASTQ files were extracted from SRA files using the fastq-dump command of the SRA toolkit (Leinonen et al., 2011). Reads were aligned and gene expression estimated as TPM using STAR (Dobin et al., 2013) and RSEM (Li and Dewey, 2011) using the rsem-calculate-expression command in paired-end mode.


	5.3 Validation data

We obtained preprocessed bulk RNA-seq and FACS data for eight PBMC samples fromHoek et al. (2015) through personal communication from the quanTIseq authors (Finotello et al., 2017). We obtained bulk RNA-seq and FACS data for four metastatic melanoma samples fromRacle et al. (2017) through GEO accession number GSE93722. We obtained bulk RNA-seq and FACS data for three ovarian cancer ascites samples (Schelker et al., 2017) from the figshare repository ( https://figshare.com/s/711d3fb2bd3288c8483a). The three ovarian cancer ascites samples consist of two technical replicates each. The two replicates are highly consistent for each of the samples (Pearson’s r≥0.98). We therefore, took the straightforward approach of merging them by the arithmetic mean for each gene. All gene expression values are expressed as TPM.


	5.4 Cell type mapping

For comparing the methods, it is essential to map the cell types of the different methods to a controlled vocabulary (CV). It has to be taken into account that the different methods resolve the cell types at different granularities. For example, while CIBERSORT predicts naive- and memory CD8+ T cells, all other methods only predict CD8+ T cells. We address this issue by creating a hierarchy of cell types, and map each cell type from a dataset or method to a node in the cell type tree (seeSupplementary Fig.S8 andSupplementary Table S4). If a node (e.g. CD8+ T cell) is to be compared among the different methods, the estimated fraction is computed as follows: (i) if the method provides an estimate for the node, take that value. (ii) If the method provides an estimate for all child nodes, sum up all children. This is done recursively until no estimates are available or the leaves of the tree are reached. (iii) If an estimate is missing for at least one of the child nodes, no estimate will be available. An exception was made for the following cell types that are only provided by few methods: T cell gamma delta, Macrophage M0, Monocyte, T cell follicular helper. If an estimate is missing for one of those cell types, the remaining child nodes will be summed up.


	5.5 Validation of the methodology

Schelker et al. (2017) provide three ovarian cancer ascites samples for which both single cell RNA-seq and bulk RNA-seq data are available. We generated artificial bulk samples by taking the mean of the TPM values for each gene of all single-cell samples. First, we compared the values using Pearson correlation on log-scale. Next, we used BioQC (Zhang et al., 2017) to test for differential enrichment of gene ontology (GO)-terms (The Gene Ontology Consortium, 2017). Finally, we ran the deconvolution methods on both the true and simulated bulk RNA-seq samples and compared the results. To demonstrate that xCell’s low correlation is in fact due to little variance between the samples, we reran xCell while additionally including the 51 immune cell reference samples (Supplementary Table S3) in the expression matrix of both simulated and true bulk RNA-seq.


	5.6 Deconvolution methods

We implemented an R package, immunedeconv, that provides a unified interface to all seven deconvolution methods compared in this paper. The CIBERSORT R source code was obtained from their website on 2018-03-26. The xCell, EPIC, MCP-counter, TIMER and quanTIseq source codes were obtained from GitHub from the following commits: dviraran/xCell@870ddc39, GfellerLab/EPIC@e5ae8803, ebecht/MCPcounter@e7c379b4, hanfeisun/TIMER@a030bac73, FFinotello/quanTIseq@ee9f4036.
We ran CIBERSORT with disabled quantile normalization, as recommended on their website for RNA-seq data. While quanTIseq provides an entire pipeline, starting with read-mapping and estimation of gene expression, we only ran the last part of that pipeline, which estimates the immune cell fractions from gene expression data. We ran TIMER with ‘OV’ on ovarian cancer ascites samples and with ‘SKCM’ on melanoma samples. We ran quanTIseq with the option tumor = TRUE on all tumor samples and tumor = FALSE on the PBMC samples. We ran EPIC with the TRef signature set on all tumor samples and with BRef on the PBMC samples. We ran xCell with the cell.types.use parameter to avoid overcompensation by the spillover correction. For simulated tumor data, cell.types.use was set to B, CAF, DC, Endo, Mac/Mono, NK, T CD4+ n.r., T CD8+, T reg. For the validation datasets, it was set to B, DC, Mono, NK, T CD4+, T CD8+, T. We disabled the mRNA scaling options of quanTIseq and EPIC for the single cell simulation benchmark using the mRNAscale and mRNA_cell options, respectively. Notably, this only has an effect on the absolute values, but not on the correlations used to compare all methods. For each of the datasets (simulated, Hoek, Schelker, Racle), we submitted all samples in a single run.


	5.7 Simulation benchmark

We simulated bulk RNA-seq samples by calculating the average gene expression per gene of cells sampled from the single cell dataset. Independently for each sample, 500 cells were randomly selected as follows: (i) a fraction f of cancer cells was drawn from N(0.33,0.3), which is the empirical distribution of cancer cells in the melanoma and ascites samples in the single cell dataset. f was constrained to the interval [0,0.99]. (ii) Half of the samples use melanoma cancer cells, the other half ovarian cancer cells. (iii) The remaining fraction 1−f was randomly assigned to immune cells, resulting in a fraction vector. (iv) The fraction vector was multiplied with 500 to obtain a cell count vector. (v) The corresponding number of cells was drawn with replacement from the single cell dataset. That implies that for cell populations with only a few cells available, the artificial bulk sample will contain the same single cell sample multiple times.
Known fractions of simulated samples are compared to the predicted scores using Pearson’s correlation coefficient r. P-values and r have been computed using the cor function in R.


	5.8 Minimal detection fraction and background prediction levels

We simulated bulk RNA-seq samples as follows: For each cell type c, we create five independent samples for each i∈{0,5,10,…,50,60,…,100,120,…,300,350,…,500}. For each sample independently, we randomly sampled i cells of type c and a background of 1000 cells containing all cell types except c from the single cell dataset. The selected cells were aggregated by calculating the arithmetic mean for each gene. This results in five batches of 300 samples each (30 spike-in levels × 10 cell types). Each of the five batches was submitted to the methods independently in a single run.
We quantify the background prediction level of cell type c as the predicted score of c in samples of type c with i = 0, i.e. in samples with no cells of type c present. We quantify the minimal detection fraction as the minimal i at which the predicted score of c is significantly different from the background prediction level (one-sided t-test, alpha = 0.05).


	5.9 Spillover

We refer to spillover as the erroneous prediction of another cell type due to a partial overlap of the signatures. We measured spillover on two datasets: (1) 44 true bulk RNA-seq samples of pure immune cells (Supplementary Table S3); (2) 45 simulated bulk RNA-seq samples of nine cell types. For each cell type c, we simulated five samples by independently aggregating 500 random cells of type c by taking the arithmetic mean. Each dataset was submitted to the methods independently in a single run. We visualized the results as chord diagrams (Fig. 3) using the R package circlize (Gu et al., 2014).


	5.10 Other tools

We generated reproducible reports using bookdown 0.7. We use Snakemake 5.2.0 (Koster and Rahmann, 2012), conda and bioconda (Grüning et al., 2018) to integrate our analyses in a reproducible pipeline.


	5.11 Code availability

An R package providing a unified interface to the deconvolution methods is freely available from GitHub: https://github.com/grst/immunedeconv. A ready-to-run Snakemake pipeline, including preprocessed data, to reproduce the results is available from https://github.com/grst/immune_deconvolution_benchmark.


	Supplementary Material

