WASP: a versatile, web-accessible single cell RNA-Seq processing platform


	Implementation

WASP is separated into two modules for primary data processing and result visualization, which are implemented in Snakemake and R Shiny. Snakemake is a workflow management system which utilizes a Python-based description language. Thanks to the usage of Conda ( https://docs.conda.io/en/latest/) environments, Snakemake ensures reproducibility and facilitates scalable execution of analyses adaptable to the used hardware. Shiny is an R package that allows users to visualize results and dynamically re-calculate them after user interaction. Results and graphics are displayed as interactive web pages which can be further modified using HTML, CSS and JavaScript. These features significantly lower the required level of expertise necessary to analyze data sets with WASP.


	Reproducible pre-processing of raw data using Snakemake

scRNA-seq data obtained using the BioRad ddSEQ or 10x Genomics protocol is first imported into the pre-processing application. Using a corresponding reference genome and annotation data provided in GTF format, the initial steps of the WASP workflow take care of reference mapping, feature extraction and quantification. This module is implemented as a Snakemake workflow in order to enable reproducibility of results, and users are provided with JSON-based files documenting the quality metrics as well as results of the analysis. The Snakemake workflow utilizes a variety of established tools for the processing including FastQC ( https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to verify the quality of the reads, STAR [16] to map reads to the reference genome, featureCounts [17] to extract reads mapping to exons and UMI-tools [18] to remove duplicate reads originating from the same mRNA fragment (Fig.1a).



	Inspection of pre-processing results and estimation of valid barcodes

For a user-friendly assessment of the results, a Shiny web application is provided presenting information about each step, e.g. number of extracted barcodes, number of mapped reads/extracted features per barcode or number of UMIs per barcode. Besides the visualization, the Shiny app also addresses the selection of correct barcodes. This is of vital importance because a major challenge in Drop-Seq-based protocols is the verification of the number of barcodes belonging to “real cells”. This problem arises due to the incorporation of free RNA released from broken cells inside the droplets during the cell separation. As a consequence, some droplets contain these free RNAs instead of mRNAs from one whole cell. Free RNAs are processed similarly to mRNAs from a whole cell and appear in the data with a barcode and UMI. However, these reads should be filtered out as they mark false-positive hits and would increase noise for the post-processing and also slow down further analyses. WASP calculates a cutoff containing all barcodes expected to belong to intact cells. For this, a ‘knee plot’ is calculated, by ordering all detected barcodes descending on the x-axis according to the number of UMIs. The y-axis shows the logarithmic number of UMIs identified in each barcode. To determine the cutoff, the first turning point of the plotted curve must be calculated. This is done by ranking the barcodes in descending order by their UMI counts. Subsequently, a score is calculated taking the difference in logarithmic UMI size between each two neighbouring barcodes divided by the difference in their rank to take into account when multiple barcodes share the same rank. Additionally, users can select a minimum of expected cells in the uploaded data to aid the cutoff detection. Finally, the barcode showing the largest difference in this metric is selected as cutoff. However, users can increase or decrease the number of barcodes to be used manually for the following analyses. Ultimately, based on the selected cutoff a gene expression matrix is generated, in which rows represent the genes and columns correspond to the selected cells (barcodes). The entries contain expression levels according to the number of unique UMIs per gene per cell. To enhance reproducibility, users may also download a table containing all parameters of the tools used during the Snakemake analysis as well as the selected number of barcodes. The gene expression matrix can then be transferred to the post-processing application for downstream analysis. Of course, the post-processing can also handle similarly formatted gene expression matrices generated by other pre-processing pipelines.


	User-friendly visualization of results using R shiny

The post-processing of the filtered scRNA-seq results is based on an additional R Shiny-based application, which requires input of a gene expression matrix obtained from the pre-processing step (for BioRad ddSEQ and 10x Genomics protocol-based experiments) or obtained from other sources (Fig.2). After uploading the corresponding data analysis files, the user is offered access to two different types of analysis: an automatic mode or manual analysis. The automatic mode is designed for less experienced users, calculating all analyses in one run with default parameters and storing the results from each step for later visualization. After the analysis, the user can browse the results in interactive web pages. More experienced users can select the manual mode which calculates the analyses stepwise, giving users the opportunity to change parameters between steps. Finally, all results are presented as an interactive web page, similar to the automatic analysis.

Before starting the analysis various filter options can be selected, such as thresholds for the amount of UMIs and genes per cell or the number of transcripts per gene, thereby allowing removal of cells with low quality. WASP provides different means in order to normalize the UMI counts between cells, cluster cells into subpopulations, and also facilitates the identification of differentially expressed genes based on the Seurat R package. This includes selecting highly variable genes to reduce noise in the further analysis and dimensionality reduction based on principal component analysis (PCA). Convenient visualizations such as t-SNE, UMAP and multidimensional scaling (MDS) enable assessment of details such as cluster assignment or expression of specific genes in all cells/clusters. All obtained results are provided as publication-ready high-quality charts in PDF format or as CSV-based data exports for additional processing with external applications such as MS Excel. Additionally, a CSV-based file is generated documenting all used parameters to support reproducible analyses (Fig.1b).


	Separate processing provides flexibility

The separation of WASP into two modules offers several benefits. One key advantage is that this enables optimal adaptation of the software to the different hardware and software requirements involved in pre-processing and post-processing. Pre-processing includes a mapping step as one of the main processing tasks. Because scRNA-seq data is normally based on eukaryotic organisms, often human or mouse cells, mapping usually requires a higher amount of memory (RAM). Another critical and computationally intensive step is the demultiplexing as every mapped read has to be checked for a valid barcode and UMI sequence. We provide tailored pre-processing routines to extract ddSeq- and 10x-specific barcode and UMI sequences. In order to speed up the demultiplexing, we implemented this step as a parallel process while maintaining reproducibility using Snakemake (Fig.2). Another aspect is the usage of well-established tools to yield the best possible results from the input data as some of these tools are designed to work on Linux-based systems only. Therefore, the pre-processing significantly benefits from a Linux-based environment with multiple cores and a high amount of RAM such as a compute cluster or workstation.
In contrast to this, post-processing is implemented with R-based scripts and packages, allowing it to be used on other operating systems as well. Also, post-processing generally does not require such a high amount of RAM like the pre-processing and does not benefit as much from a high number of CPU cores, as its calculation is typically performed within minutes on a standard laptop. Furthermore, the post-processing starts with a gene expression matrix making it independent from the protocol used for the scRNA-seq data generation (Fig.2). Since scRNA-seq is still a relatively new technology, there are some specific problems during post-processing in some data sets. Experienced users, however, might be able to solve these issues and retrieve more information by changing specific parameters.
To fulfill these requirements and to offer users the highest possible flexibility, we have separated the pre-processing and post-processing into two parts. This allows users to easily process ddSeq- and 10x-based reads while benefitting from established tools during the pre-processing. In addition, users are able to use the post-processing on normal computers or even laptops based on Linux or other platforms such as Windows or MacOS. The separation further expands the application of WASP to include scRNA-seq data from protocols such as 10x Genomics. Also, the post-processing enables experienced users to modify parameters in order to extract as much information as possible from challenging data sets.
To allow easy use, both pre-processing and post-processing are available via Docker, eliminating the need to install further dependencies. Moreover, the pre-processing can easily be installed using Conda and the post-processing is available as a standalone version for Windows, requiring only an installed web browser.
