nf-rnaSeqCount: A Nextflow pipeline for obtaining raw read counts from RNA-seq data


	PIPELINE IMPLEMENTATION AND WORKFLOW

The purpose of the pipeline is to make the task of producing raw read counts for performing differential expression analysis easier, especially for other researchers with little or no knowledge of bioinformatics. The pipeline also needs to be portable and reproducible in order to allow scaling to different computational platforms when large or small datasets are being analysed.


	nf-rnaSeqCount Implementation

Nextflow (Di Tommaso et al., 2017) and Singularity (Kurtzer et al., 2017) were used to implement the pipeline into a portable and reproducible pipeline. Nextflow is a Groovy-based domain-specific language (DSL) specifically designed for bioinformaticists with strong programming knowledge to solve many of the challenges of the inability to reproduce data analysis. Some of these challenges are due to computational platform variations, software and database management, complexity of pipelines, intermediate file handling and lack of good practice (Di Tommaso et al., 2017). The advantages of Nextflow as a workflow management system are that it can handle input and output as channels between processes and reduce the need of having to create intermediate directories to store intermediate results. Variables can also be declared dynamically with no need to explicitly name files, and only the output that is required can be saved to files in each analysis step.
Nextflow also has a number of features that promote pipeline reproducibility and portability. It supports Docker[1] and Singularity[2], the two most used containerisation softwares in the bioinformatics community; it integrates/supports the popular VCS GitHub[3] for sharing of code, and version management; and it allows for scaling of computational pipelines on HPC and cloud systems by providing out of the box scheduler support for Sun Grid Engine (SGE), PBS/Torque, Platform Load Sharing Facility (LSF), Simple Linux Utility for Resource Management (SLURM), HTCondor and Amazon Web Services (AWS) (Di Tommaso et al., 2017).
To facilitate reproducibility and portability of the pipeline, Docker containers were created for each of the processes applications in the pipeline and hosted on DockerHub[4] to use with Singularity when executing the pipeline. Singularity is a lightweight platform for building and running containers that is gaining popularity in the bioinformatics community, especially for performing analysis on a large scale. It uses an image format that is supported across different versions of the C library and kernels, and gives users the ability to encapsulate their pipelines, all required applications, dependencies and base OS environment into a single file that can be locked, copied, shared and archived (Kurtzer et al., 2017). Docker containers hosted on DockerHub can be downloaded and converted into Singularity image format (SIF). Such image files have standard Linux/UNIX file permission and cannot be modified (not even by the host OS), thus can be used with confidence that nothing within the image has changed. Each SIF contains the necessary software required by each process to run. This removes the need to install all the software tools used for these analyses.


	nf-rnaSeqCount Workflow

The nf-rnaSeqCount pipeline depends on Nextflow and Singularity to run. These two softwares must be installed in order for the pipeline to be executed. The input for the nf-rnaSeqCount pipeline are FASTQ files (both paired- and single-ended) for the RNA-seq data to be analysed, a reference genome (in FASTA format) and an annotation file (in GTF format) for the reference genome. These can be passed as command-line arguments during pipeline execution or specified in a configuration file (e.g. main.conf inFigure 2) that can also be passed to the pipeline during execution. Other parameters of the pipeline can be specified in a similar way. Unlike most pipelines that are executed in one go, the nf-rnaSeqCount is a modular pipeline that can be executed in multiple stages (main.nf inFigure 2), allowing the users to interact with the results at each step of the pipeline.
The different modules of the pipeline can be grouped into 4 steps, and each module (specified using the –mode argument) calls the required process in the workflow. The 4 workflow steps and their modules are as follows: (1) Data Preparation:prep.Containers and prep.Indexes; (2) Quality Control:run.ReadQC and run.ReadTrimming; (3) Alignment and Quantification:run.ReadAlignment and run.ReadCounting; and (4) MultiQC:run.MultiQC. The nf-rnaSeqCount pipeline can be obtained using the following command:
The help menu for the pipeline can be accessed with the following command:


	Data Preparation

Data preparation is mandatory at the first step of the workflow. The first process in this step, run_DownloadContainers, downloads all the required workflow containers with the required software for executing the pipeline from DockerHub and converts them to Singularity format. This step is crucial as all processes in the pipeline depend on the applications that are packaged in these containers:
Once the Singularity containers have been downloaded, the run_GenerateSTARIndex and run_GenerateBowtie2Index processes will index the reference genome (for aligning the RNA-seq reads and quantifying the abundance of the identified genomic features) using STAR (Dobin et al., 2013) and Bowtie (Langmead et al., 2009), respectively. Before indexing, the location for the reference genome (FASTA), annotation file (GTF), input FASTQ files and output directory can be provided in a configuration file (as inFigure 2), and passed to the Nextflow command using the -c option. However, the files can also be passed as command-line arguments when executing the pipeline. The remainder of the pipeline assumes that all the required files are provided in a configuration file called main.conf. To index the reference genome using STAR and Bowtie, the following commands can be used:


	Quality Control

The nf-rnaSeqCount pipeline incorporates FastQC (Andrews, 2010) and Trimmomatic (Bolger et al., 2014) for pre-processing of reads. This “quality control” (QC) step of the nf-rnaSeqCount workflow is optional; however, it is very useful to first assess the initial quality of the reads so that poor quality reads and contaminating “adapter” sequences can be removed. In this step, the run_QualityChecks process uses FastQC to assess the quality of the RNA-seq reads. The run_ReadTrimming process, which uses Trimmomatic, is then used to remove technical sequences and poor quality bases from the data. To assess the quality of the RNA-seq reads, the following command can be executed:
Once the quality of the reads has been assessed, Trimmomatic can be used to remove low quality bases and adapter sequences. The –trim option can be used to pass Trimmomatic arguments to the pipeline:


	Alignment and Quantification

The alignment and quantification is the main step of the nf-rnaSeqCount workflow. In this step, the RNA-seq reads are aligned to the reference genome (indexed in the “Data Preparation” step) using STAR (Dobin et al., 2013) in the run_ReadAlignment process. The resulting BAM files are then used to quantify the abundance of the identified genes using both featureCounts (Liao et al., 2014) (run_FeatureCounts process) and htseq-count (Anders et al., 2015) (run_HTSeqCount process). To align the reads to the reference genome and quantify the abundance of the genomic features identified, the following commands can be used:


	MultiQC

The MultiQC step is optional. In this step, the run_MultiQC process uses MultiQC (Ewels et al., 2016) to collect all the statistics from all the programs that were executed in the workflow, and give a summary of all statistics in an HTML file. To obtain the summary statistics of the workflow, the following command can be used:


	nf-rnaSeqCount Pipeline Output

The output directory for the nf-rnaSeqCount pipeline contains a number of folders:
 ** n number of folders corresponding to each of the samples that were processed by the pipeline. These folders contain general statistics on mapping using STAR.
** featureCounts folder containing read counts matrix (gene_counts_final.txt) for htseq-count. This file can be used for differential expression analysis.
** htseqCounts folder containing read counts matrix (gene_counts_final.txt) for featureCounts. This file can be used for differential expression analysis.
** report_QC folder containing MultiQC QC reports in HTML format. This file can be used to assess the quality of read mapping and gene quantification.
** report_workflow folder containing pipeline execution reports. These files can be used to trace the execution of the pipeline and check other metadata in order to assign resources correctly to the processes.


	Testing

The RNA-seq data of black South African patients with SSc were used to test and validate the usefulness of the nf-rnaSeqCount pipeline. This transcriptome data was from a study byFrost et al. (2019), conducted with ethics approval of the Human Research Ethics Committee (HREC [Medical]) of the University of the Witwatersrand (certificate number M120512). The nf-rnaSeqCount pipeline was initially developed and tested on the University of the Witwatersrand (Wits) Computing cluster using SLURM and PBS job schedulers. The pipeline also has been successfully tested on the University of Cape Town (UCT) eResearch HPC[5] and on Amazon’s AWS[6] using RNA-seq data for this study. On the UCT’s eResearch HPC, which has SLURM as the job scheduler, the same computing requirements as with the Wits Computing cluster were used.
For AWS execution of the pipeline, the Nextflow supported Amazon Machine Image (AMI), ami-4b7daa32, was used to deploy an Amazon Elastic Block Store (EBS) of 1000GB using the Elastic Compute Cloud (EC2), m4.10xlarge, with 40 virtual CPUs and 160 GB of memory. All AWS analyses were performed on the European (Ireland) region since the Nextflow AMI was only available for this region. The pipeline was executed using the standard computing environment (no job scheduler) on the EC2. Estimating the cost of running the analysis on the AWS (February 2019 pricing), the m4.10xlarge cost $2.22 per hour[7]), the standard general purpose solid-state drive (SSD) costs $0.11 per GB-month[8]. Given that the analysis took approximately 4 hours, the total approximated cost for running the nf-rnaSeqCount pipeline on the SSc data was:
 ($2.2hour×4hours)+($0.11∕GBmonth×1000GB×4hours740hours)=$9.39


	Benchmarking

The nf-rnaSeqCount pipeline was further benchmarked for time, memory and CPU usage against the popular Rsubread package (Liao et al., 2019). Benchmarking of the nf-rnaSeqCount pipeline against the Rsubread package was to determine the speed at which both tools complete different tasks, computational resource requirements, scalability on large datasets, ability to perform tasks in parallel as well as usability. The Rsubread package was chosen as it is a comprehensive tool and performs similar RNA-seq analysis workflow (read alignment and read counting) as our nf-rnaSeqCount pipeline. The RNA-seq data used for benchmarking was downloaded from the Gene Expression Omnibus (GEO) with an accession GSE111073[9]. The data consisted of 21 RNA-seq breast cancer samples from walnut-consuming patients (10 samples) and control group (11 samples) (Hardman et al., 2019a,2019b). Benchmarking of the two tools was carried out on the Wits Computing cluster, with 48 GB of memory and 12 CPUs allocated to each task in the analysis workflow, i.e., reference genome indexing, read alignment and read counting.
Figure 3 summarises the benchmarking results between nf-rnaSeqCount and the Rsubread package in terms of their time, memory and CPU usage when performing genome indexing, read alignment and read counting. The nf-rnaSeqCount pipeline was able to distribute tasks across multiple nodes on the Wits Computing cluster, i.e., run jobs in parallel, in addition to multi-threading, whereas the Rsubread applications were multi-threaded only on a single node.
When it comes to reference genome indexing, both the nf-rnaSeqCount (using STAR and Bowtie) and Rsubread (using index) performed equally in terms of time usage, with each tool completing the task in 66 and 71 minutes, respectively. The nf-rnaSeqCount utilised more resources for indexing (35 GB of memory and 825% of allocated CPUs) compared to Rsubread (16 GB and 99% of allocated CPUs). For the alignment of the RNA-seq reads to the reference genome, nf-rnaSeqCount (using STAR) outperformed Rsubread (using align), with the alignment tasks completed 31 and 215 minutes, respectively. nf-rnaSeqCount completed the alignment tasks using 30 GB of memory and 648% of the allocated CPUs, whilst Rsubread completed the alignment tasks using 18 GB of memory and 1088% of the allocated CPUs.
Finally, for the read counting, the nf-rnaSeqCount (using htseq-count and featureCounts) was outperformed by Rsubread (using featureCounts), with the read counting tasks completed in 235 and 6 minutes, respectively. nf-rnaSeqCount completed the read counting tasks using 3 GB of memory and 810% of the allocated CPUs, whilst Rsubread completed the alignment tasks using 3 GB of memory and 870% of the allocated CPUs. The huge difference with time usage seen between nf-rnaSeqCount and the Rsubread when it comes to read counting can mainly attributed to htseq-count. Unlike other tools in both workflows, htseq-count cannot be multi-threaded, thus drastically reducing performance and increasing amount of time it takes to complete the read counting tasks for the nf-rnaSeqCount workflow.
