RASflow: an RNA-Seq analysis workflow with Snakemake


	Implementation

Figure1 shows a schematic representation of the RASflow workflow. It starts with performing QC of the raw FASTQ files using FastQC ( https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). The QC report is presented to the user along with a question of whether the reads should be trimmed. When opted for, trimming is performed using the tool Trim Galore ( https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) and subsequently, an additional QC report is generated.

When the user is satisfied with the quality of the reads, the workflow proceeds to the next step: quantification of read abundance or expression level for transcripts or genes. The user decides whether to map the reads to a transcriptome or a genome depending on the goal of the analysis and availability of data. If the purpose of the analysis is to identify differentially expressed genes, it is suggested to map the reads to a transcriptome using pseudo alignment with Salmon [9]. A quantification table of the transcripts is generated from this step. Alternatively, the user can choose for the reads to be mapped to a genome. The aligner used in RASflow is HISAT2 [25] which has relatively modest memory requirements (∼4.3GB for the human genome) compared with for example the STAR aligner (requiring ∼27GB for the human genome) [26]. The alignment step is followed by a quality evaluation performed by Qualimap2 [27] and feature counting done by featureCounts [28] or htseq-count [29]. To be noted, after most of the steps, a summary report is generated using MultiQC [30].
When a quantification matrix for the genes/transcripts has been produced, RASflow can proceed to perform a DEA analysis using edgeR [31,32] or DESeq2 [33]. RASflow supports both single and paired statistical tests. The user specifies which statistical test mode to be applied in the configuration file based on their experimental design. If the reads were mapped to a transcriptome, DEA will be done on both transcript- and gene-level. In any case, the outputs of DEA include three types of tables: normalized quantification tables, some important statistics for the whole gene or transcript list, and the list of significantly differentially expressed genes or transcripts (with default threshold of FDR<0.05). The raw count is normalized based on Trimmed Mean of M values (TMM) [34] (if edgeR is used) or the median-of-ratios method [35] (if DESeq2 is used) when the reads are mapped to a genome. But if the reads are mapped to a transcriptome, the normalized values are estimated Transcripts Per Million (TPM) from Salmon scaled using the average transcript length over samples and then the library size by "tximport" [36]. The results of DEA is also visualized with a volcano plot enabling visual identification of genes with high fold change whose differential expression is also statistically significant, and a heatmap that not only visualizes the expression pattern of the identified differentially expressed genes, but also a clustering of the samples based on those genes, so that the user can get an idea of how well separated the groups are.
To ensure smooth installation and reproducibility of the workflow, all the tools included are fixed to a specific version which can be found in the environment configuration file (env.yaml).


	Discussion



	Virtual environment by Conda

The whole workflow is installed and run in a virtual environment created by Conda. While creating the virtual environment, all dependencies using the specified versions are installed at once. This ensures not only the smooth installation and running of RASflow, but also a reproducible analysis independent of the operating system and machine.


	Snakemake as framework

Snakemake is a scalable workflow engine that helps to manage workflows in an easy way. It divides the whole workflow into rules with each rule accomplishing one step of the workflow. The input of one rule is the output from the rule corresponding to the previous step, making the dataflow easy to track. Thanks to this logic, the whole workflow becomes highly modular, so users can easily expand the workflow or replace part of it, also for complicated workflows.
RASflow organizes the rules carrying out one big step of the workflow in one file (with extension.rules). All the files are then integrated into one main file (main.py). For the users who are satisfied with RASflow’s default setting, they can manage the workflow simply through the configuration file to tell RASflow which pipeline and which tools they want to use. Advanced users may change the settings and parameters in the.rules files and may also substitute tools for example to try out new methods as they are published.


	Transciptome and genome as reference

RASflow allows users to supply their own genomic or transcriptomic reference. This enables users to study expression in species where no public reference is available or the users have alternative references that they wish to utilize. It should be noted that if one aims for transcript-level analysis, a transcriptome should be used as reference.
But some analyses other than DEA require the reads to be mapped to a genome and gene-level DEA is more robust and experimentally actionable, so RASflow still provides the traditional workflow of genome alignment and DEA based on gene counts.


	Comparison with other tools

We compared RASflow to other existing workflows as shown in Table2. As we can see from the table, some workflows do not include QC steps [15,18]. Some of the workflows are limited to specific organisms typically human or mouse and in some cases other model organisms [12,14–16]. Some of them have functionality only for mapping reads to a reference genome and do not support the use of a transcriptome reference [12,14,15,17]. ARMOR includes both genome and transcriptome as mapping reference but does not support genome-based quantification of expression and subsequent DEA.
Considering hardware requirement, BioJupies is marked as “low" because it is a web application and the compute capacity is offered on the server side. The workflows marked with “high" use STAR for genome alignment which requires about 27GB of RAM to align reads to the human genome. hppRNA and RNACocktail support both STAR and other aligners which require comparably low RAM, such as HISAT2 which is used in RASflow. Tests performed show that RASflow can be used to run human genome alignment smoothly on a personal computer with only 8GB of RAM.
As for workflow installation, RASflow, UTAP, ARMOR, and VIPER all use Conda to create a virtual environment and to install the required software, making workflow installation easy and robust. hppRNA provides scripts to automatically install all the required software but as it is not done through the use of a virtual environment, some software may conflict with software already installed on the machine. The aRNApipe and RNACocktail workflows require the user to install all the software manually which is time-consuming and can also easily lead to version conflicts.
After installation, executing the workflow can also present challenges. In order to use the aRNApipe and RNACocktail workflows on their own data, the user needs to know programming very well. The hppRNA workflow comes with a very detailed and useful manual for the user to follow which helps a lot. The UTAP and BioJupies workflows both provide graphical user interfaces and can be used without any programming skills. While the remaining workflows do not provide graphical interfaces, they use Snakemake to manage all the steps in the workflow, making them easy to use also for those with limited programming skills.


	Extension of RASflow

Thanks to the high modularity of RASflow, it is very easy to exchange the tools applied in RASflow with other tools if they are more appropriate for specific research interest or they are newly developed. Thanks to the feedback from users, we have already added the htseq-count tool for feature counting and the DESeq2 tool for DEA as extra options since the first version of RASflow. Advanced users can also do this by themselves without much effort. We welcome any feedback and contribution through GitHub page to improve RASflow.
RASflow can also be extended to realize other functions, such as Single Nucleotide Variant (SNV) detection, pathway analysis, and so on.
