GEMmaker: process massive RNA-seq datasets on heterogeneous computational infrastructure


	Implementation

GEMmaker uses Nextflow and is a combination of Groovy scripts for interfacing with Nextflow, Python scripts for wrangling intermediate data, and Bash scripts for execution of each software tool in the workflow. Nextflow was selected as the framework because it is widely used, is well supported, has a robust community of workflow creators in the life sciences, supports multiple computing platforms and supports containerization systems such as Docker and Singularity. Nextflow allows for execution of workflows from a command-line interface, which is common with most HPC platforms. These attributes make GEMmaker relatively easy to use. The following is an example command-line for execution of GEMmaker on a local machine using Singularity (for containerization), quantification using Salmon, and a file containing a list of SRA run IDs for Arabidopsis thaliana Illumina datasets:
GEMmaker adopts the nf-core recommendations and standards to provide consistency in functionality with other popular nf-core workflows.
GEMmaker uses a variety of software tools for gene expression-level quantification and quality control that can be selected by the user. These software are listed in Table1 and the step-by-step flow of the workflow using these tools is shown in Fig.1. There are four primary paths for gene expression quantification within GEMmaker: STAR, HISAT2, Salmon and kallisto. The STAR and HISAT2 paths include read trimming via Trimmomatic, SAMtools for storing alignments and Stringtie for quantification. Salmon and kallisto do not require those steps. All paths provide a MultiQC report to help end-users explore the quality of results from the workflow.
As mentioned previously, GEMmaker is designed to scale. It can scale to process increasingly larger experiments (or large numbers of samples from public repositories) that can include hundreds to thousands of RNA-seq samples without intermediate files overrunning available compute storage. It supports execution on a large variety of computational platforms such that researchers can take full advantage of the compute facilities available to them including local desktop workstations, institutional clusters, national-funded resources such as XSEDE [34], the Pacific Research Platform [31], and commercial clouds.
To ensure storage requirements are not exceeded, GEMmaker moves input FASTQ files between three folders: “stage”, “processing” and “done”. Initially all samples are placed in the “stage” folder and GEMmaker will move into the “processing” folder as many samples as there are CPUs available. The user sets the number of CPUs that the workflow can use with the –max_cpus argument. On a compute cluster, this could be tens to hundreds. Nextflow is then instructed to automatically begin processing any samples that appear in the “processing” folder. As usual, Nextflow will process samples in parallel, using all CPUs, by first executing the first step for all samples, then the second for all samples, and so forth. However, because GEMmaker limits the number of samples to the number of CPUs, when a sample completes a step, it will move to the next step because Nextflow does not see any samples waiting. When a sample fully completes all steps, GEMmaker will then move the sample from the “processing” folder into the “done” folder and will move one sample from the “stage” folder into the “processing” folder. Nextflow sees this new sample in the “processing” folder and immediately begins processing that sample through each step. There is no lag between the time one sample finishes, and another begins and Nextflow should keep all CPUs consistently busy processing samples in parallel. As the workflow progresses for each sample, GEMmaker will cleanup unwanted intermediate files. This ensures space is cleaned before more samples begin processing. If the user specifies a –max_cpu size that does not exceed the resources of the computational platform, then GEMmaker can successfully process hundreds to thousands of samples.
While GEMmaker, by default, cleans all intermediate files, there are arguments that can be provided, as described in the online documentation, to control which intermediate files are removed. Users can keep downloaded SRA and FASTQ files, trimmed FASTQ files, SAM and BAM alignment files, and kallisto and Salmon pseudoalignment files. If any of these files are needed for downstream analyses they can be retained.
The speed at which the samples are processed depends on the number of processors and available memory of the compute nodes. Users with limited CPUs or RAM may need more time to process all samples. If users set the –max_cpus setting higher than storage will support, then GEMmaker may not be able to cleanup intermediate files before overrunning storage. It is difficult to recommend a value which maximizes the trade-off between the number of CPUs and storage requirements because RNA-seq samples and genomic reference sequences can be dramatically different in size, resulting in different sized intermediate files. However, using averaged values from the sample data reported here, we provide a rough recommendation that users have about 30 times the storage of an average sample size, times the number of CPUs when using HISAT2. For an average sample size of 2.5GBs this would require 75 GB per CPU. For kallisto and Salmon we recommend 7 times the storage of an average sample per CPU (17GBs).
To ensure portability between HPC systems, GEMmaker makes use of containerized software. This alleviate the burden of installing the same software versions on every computational system on which it is run. All GEMmaker dependent software are provided in the GEMmaker docker image and their versions are listed in Table1. GEMmaker retrieves this Docker image from Docker Hub the first time it is run—users need not install any software other than Nextflow and a containerization software (Singularity or Docker). Thus, a GEMmaker workflow can be performed on any computational system and results will be reproducible and consistent.
